Sequence variant analysis of cell-free dna for cancer screening

ABSTRACT

A frequency of somatic mutations in a biological sample (e.g., plasma or serum) of a subject undergoing screening or monitoring for cancer, can be compared with that in the constitutional DNA of the same subject. A parameter can derived from these frequencies and used to determine a classification of a level of cancer. False positives can be filtered out by requiring any variant locus to have at least a specified number of variant sequence reads (tags), thereby providing a more accurate parameter. The relative frequencies for different variant loci can be analyzed to determine a level of heterogeneity of tumors in a patient.

BACKGROUND

It has been shown that tumor-derived DNA is present in the cell-freeplasma/serum of cancer patients (Chen X Q et al. Nat Med 1996; 2:1033-1035). Most current methods are based on the direct analysis ofmutations known to be associated with cancer (Diehl F et al. Proc NatlAcad Sci 2005; 102: 16368-16373; Forshew T et al. Sci Transl Med 2012;4: 136ra68). Another method has investigated cancer-associated copynumber variations detected by random sequencing of plasma DNA (U.S.Patent Publication 2013/0040824 by Lo et al.).

It is known that with time, more than one cancer cell would acquiregrowth advantage and produce multiple clones of daughter cells.Ultimately, the tumorous growth and/or its metastatic foci would containa conglomerate of groups of clonal cancer cells. This phenomenon istypically referred as tumor heterogeneity (Gerlinger M et al. N Engl JMed 2012; 366: 883-892; Yap T A et al. Sci Transl Med 2012; 4: 127ps10).

Cancers are known to be highly heterogeneous, i.e. mutation profile ofcancers of the same tissue type can vary widely. Therefore, the directanalysis of specific mutations can typically detect only a subset of thecases within a particular cancer type known to be associated with thosespecific mutations. Additionally, tumor-derived DNA is usually the minorspecies of DNA in human plasma; the absolute concentration of DNA inplasma is low. Therefore, the direct detection of one or a small groupof cancer-associated mutations in plasma or serum may achieve lowanalytical sensitivity even among patients with cancers known to beharboring the targeted mutations. Furthermore, it has been shown thatthere is significant intratumoral heterogeneity in terms of mutationseven within a single tumor. The mutations can be found in only asubpopulation of the tumor cells. The difference in the mutationalprofiles between the primary tumor and the metastatic lesions is evenbigger. One example of intratumoral and primary-metastasis heterogeneityinvolves the KRAS, BRAF and PIK3CA genes in patients suffering fromcolorectal cancers (Baldus et al. Clin Cancer Research 2010. 16:790-9.).

In a scenario in which a patient has a primary tumor (carrying a KRASmutation but not a PIK3CA mutation) and a concealed metastatic lesion(carrying a PIK3CA mutation but not a KRAS mutation), if one focused onthe detection of the KRAS mutation in the primary tumor, the concealedmetastatic lesion cannot be detected. However, if one included bothmutations in the analysis, both the primary tumor and the concealedmetastatic lesion can be detected. Hence, the test involving bothmutations would have a higher sensitivity in the detection of residualtumor tissues. Such a simple example becomes more complex when one isscreening for cancer, and as one has little or no clue of the types ofmutations that might occur.

It is therefore desirable to provide new techniques to perform a broadscreening, detection, or assessment for cancer

SUMMARY

Embodiments can observe a frequency of somatic mutations in a biologicalsample (e.g., plasma or serum) of a subject undergoing screening ormonitoring for cancer, when compared with that in the constitutional DNAof the same subject. Random sequencing can be used to determine thesefrequencies. A parameter can derived from these frequencies and used todetermine a classification of a level of cancer. False positives can befiltered out by requiring any variant locus to have at least a specifiednumber of variant sequence reads (tags), thereby providing a moreaccurate parameter. The relative frequencies for different variant locican be analyzed to determine a level of heterogeneity of tumors in apatient.

In one embodiment, the parameter can be compared with the same parameterderived from a group of subjects without cancer, or with a low risk ofcancer. A significant difference in the parameter obtained from the testsubject and that from the group of subjects without cancer, or with alow risk of cancer, can indicate an increased risk that the test subjecthas cancer or a premalignant condition or would develop cancer in thefuture. Thus, in one embodiment, plasma DNA analysis can be conductedwithout prior genomic information of the tumor. Such an embodiment isthus especially useful for the screening of cancer.

In another embodiment, embodiments can also be used for monitoring acancer patient following treatment and to see if there is residual tumoror if the tumor has relapsed. For example, a patient with residual tumoror in whom the tumor has relapsed would have a higher frequency ofsomatic mutations than one in whom there is no residual tumor or in whomno tumor relapse is observed. The monitoring can involve obtainingsamples from a cancer patient at multiple time points followingtreatment for ascertaining the temporal variations of tumor-associatedgenetic aberrations in bodily fluids or other samples with cell-freenucleic acids, e.g. plasma or serum.

According to one embodiment, a method detects cancer or premalignantchange in a subject. A constitutional genome of the subject is obtained.One or more sequence tags are received for each of a plurality of DNAfragments in a biological sample of the subject, where the biologicalsample includes cell-free DNA. Genomic positions are determined for thesequence tags. The sequence tags are compared to the constitutionalgenome to determine a first number of first loci. At each first loci, anumber of the sequence tags having a sequence variant relative to theconstitutional genome is above a cutoff value, where the cutoff value isgreater than one. A parameter is determined based on a count of sequencetags having a sequence variant at the first loci. The parameter iscompared to a threshold value to determine a classification of a levelof cancer in the subject.

According to another embodiment, a method analyzes a heterogeneity ofone or more tumors of a subject. A constitutional genome of the subjectis obtained. One or more sequence tags are received for each of aplurality of DNA fragments in a biological sample of the subject, wherethe biological sample includes cell-free DNA. Genomic positions aredetermined for the sequence tags. The sequence tags are compared to theconstitutional genome to determine a first number of first loci. At eachfirst loci, a number of the sequence tags having a sequence variantrelative to the constitutional genome is above a cutoff value, where thecutoff value is greater than one. A measure of heterogeneity of the oneor more tumors is calculated based on the respective first numbers ofthe set of first genomic locations.

According to another embodiment, a method determines a fractionalconcentration of tumor DNA in a biological sample including cell-freeDNA. One or more sequence tags are received for each of a plurality ofDNA fragments in the biological sample. Genomic positions are determinedfor the sequence tags. For each of a plurality of genomic regions, arespective amount of DNA fragments within the genomic region isdetermined from sequence tags having a genomic position within thegenomic region. The respective amount is normalized to obtain arespective density. The respective density is compared to a referencedensity to identify whether the genomic region exhibits a 1-copy loss ora 1-copy gain. A first density is calculated from respective densitiesidentified as exhibiting a 1-copy loss or from respective densitiesidentified as exhibiting a 1-copy gain. The fractional concentration iscalculated by comparing the first density to another density to obtain adifferential, wherein the differential is normalized with the referencedensity.

Other embodiments are directed to systems and computer readable mediaassociated with methods described herein.

A better understanding of the nature and advantages of the presentinvention may be gained with reference to the following detaileddescription and the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of a method 100 for detecting cancer orpremalignant change in a subject according to embodiments of the presentinvention.

FIG. 2 shows a flowchart of a method comparing the sample genome (SG)directly to the constitutional genome (CG) according to embodiments ofthe present invention.

FIG. 3 shows a flowchart of a method 300 comparing the sample genome(SG) to the constitutional genome (CG) using the reference genome (RG)according to embodiments of the present invention.

FIG. 4 is a table 400 showing the number of cancer-associated singlenucleotide mutations correctly identified using different number ofoccurrences as the criterion for classifying a mutation as being presentin the sample according to embodiments of the present invention when thefractional concentration of tumor-derived DNA in the sample is assumedto be 10%.

FIG. 5 is a table showing the expected number of false-positive loci andthe expected number of mutations identified when the fractionalconcentration of tumor-derived DNA in the sample is assumed to be 5%.

FIG. 6A is a graph 600 showing the detection rate of cancer-associatedmutations in plasma with 10% and 20% plasma fractional concentrations oftumor-derived DNA and using four and six occurrences (r) as criteria forcalling potential cancer-associated mutations. FIG. 6B is a graph 650showing the expected number of nucleotide positions falsely classifiedas having a nucleotide change using criteria of occurrence (r) of 4, 5,6 and 7 vs. sequencing depth.

FIG. 7A is a graph 700 showing the number of true cancer-associatedmutation sites and false-positive sites with difference sequencingdepths when the fractional concentration of tumor-derived DNA in thesample is assumed to be 5%. FIG. 7B is a graph 750 showing the predictednumber of false-positive sites involving the analysis of the wholegenome (WG) and all exons.

FIG. 8 is a table 800 showing results for 4 HCC patients before andafter treatment, including fractional concentrations of tumor-derivedDNA in plasma according to embodiments of the present invention.

FIG. 9 is a table 900 showing detection of the HCC-associated SNVs in 16healthy control subjects according to embodiments of the presentinvention.

FIG. 10A shows a distribution plot of the sequence read densities of thetumor sample of an HCC patient according to embodiments of the presentinvention. FIG. 10B shows a distribution plot 1050 of z-scores for allthe bins in the plasma of a HCC patient according to embodiments of thepresent invention.

FIG. 11 shows a distribution plot 1100 of z-scores for the plasma of anHCC patient according to embodiments of the present invention.

FIG. 12 is a flowchart of a method 1200 of determining a fractionalconcentration of tumor DNA in a biological sample including cell-freeDNA according to embodiments of the present invention.

FIG. 13A shows a table 1300 of the analysis of mutations in the plasmaof the patient with ovarian cancers and a breast cancer at the time ofdiagnosis according to embodiments of the present invention.

FIG. 13B shows a table 1350 of the analysis of mutations in the plasmaof the patient with bilateral ovarian cancers and a breast cancer aftertumor resection according to embodiments of the present invention.

FIG. 14A is a table 1400 showing detection of single nucleotidevariations in plasma DNA for HCC1. FIG. 14B is a table 1450 showingdetection of single nucleotide variations in plasma DNA for HCC2.

FIG. 15A is a table 1500 showing detection of single nucleotidevariations in plasma DNA for HCC3. FIG. 15B is a table 1550 showingdetection of single nucleotide variations in plasma DNA for HCC4.

FIG. 16 is a table 1600 showing detection of single nucleotidevariations in plasma DNA for the patient with ovarian (and breast)cancer.

FIG. 17 is a table 1700 showing the predicted sensitivities of differentrequirements of occurrence and sequencing depths.

FIG. 18 is a table 1800 showing the predicted numbers of false positiveloci for different cutoffs and different sequencing depths.

FIG. 19 shows a tree diagram illustrating the number of mutationsdetected in the different tumor sites.

FIG. 20 is a table 2000 showing the number of fragments carrying thetumor-derived mutations in the pre-treatment and post-treatment plasmasample.

FIG. 21 is a graph 2100 showing distributions of occurrence in plasmafor the mutations detected in a single tumor site and mutations detectedin all four tumor sites.

FIG. 22 is a graph 2200 showing predicted distribution of occurrence inplasma for mutations coming from a heterogeneous tumor

FIG. 23 demonstrates the specificity of embodiments for 16 healthycontrol subjects were recruited.

FIG. 24 is a flowchart of a method 2400 for analyzing a heterogeneity ofone or more tumors of a subject according to embodiments of the presentinvention.

FIG. 25 shows a block diagram of an example computer system 2500 usablewith system and methods according to embodiments of the presentinvention.

DEFINITIONS

As used herein, the term “locus” or its plural form “loci” is a locationor address of any length of nucleotides (or base pairs) which may have avariation across genomes. A “bin” is a region of predetermined length ina genome. A plurality of bins may have a same first length (resolution),while a different plurality can have a same second length. In oneembodiment, the bins do not overlap each other.

The term “random sequencing” as used herein refers to sequencing wherebythe nucleic acid fragments sequenced have not been specificallyidentified or predetermined before the sequencing procedure.Sequence-specific primers to target specific gene loci are not required.The term “universal sequencing” refers to sequencing where sequencingcan start on any fragment. In one embodiment, adapters are added to theend of a fragment, and the primers for sequencing attached to theadapters. Thus, any fragment can be sequenced with the same primer, andthus the sequencing can be random.

The term “sequence tag” (also referred to as sequence read) as usedherein refers to string of nucleotides sequenced from any part or all ofa nucleic acid molecule. For example, a sequenced tag may be a shortstring of nucleotides (e.g., ˜30) sequenced from a nucleic acidfragment, a short string of nucleotides at both ends of a nucleic acidfragment, or the sequencing of the entire nucleic acid fragment thatexists in the biological sample. A nucleic acid fragment is any part ofa larger nucleic acid molecule. A fragment (e.g. a gene) may existseparately (i.e. not connected) to the other parts of the larger nucleicacid molecule.

The term “constitutional genome” (also referred to a CG) is composed ofthe consensus nucleotides at loci within the genome, and thus can beconsidered a consensus sequence. The CG can cover the entire genome ofthe subject (e.g., the human genome), or just parts of the genome. Theconstitutional genome (CG) can be obtained from DNA of cells as well ascell-free DNA (e.g., as can be found in plasma). Ideally, the consensusnucleotides should indicate that a locus is homozygous for one allele orheterozygous for two alleles. A heterozygous locus typically containstwo alleles which are members of a genetic polymorphism. As an example,the criteria for determining whether a locus is heterozygous can be athreshold of two alleles each appearing in at least a predeterminedpercentage (e.g., 30% or 40%) of reads aligned to the locus. If onenucleotide appears at a sufficient percentage (e.g., 70% or greater)then the locus can be determined to be homozygous in the CG. Althoughthe genome of one healthy cell can differ from the genome of anotherhealthy cell due to random mutations spontaneously occurring during celldivision, the CG should not vary when such a consensus is used. Somecells can have genomes with genomic rearrangements, e.g., B and Tlymphocytes, such as involving antibody and T cell receptor genes. Suchlarge scale differences would still be a relatively small population ofthe total nucleated cell population in blood, and thus suchrearrangements would not affect the determination of the constitutionalgenome with sufficient sampling (e.g., sequencing depth) of blood cells.Other cell types, including buccal cells, skin cells, hair follicles, orbiopsies of various normal body tissues, can also serve as sources ofCG.

The term “constitutional DNA” refers to any source of DNA that isreflective of the genetic makeup with which a subject is born. For asubject, examples of “constitutional samples”, from which constitutionalDNA can be obtained, include healthy blood cell DNA, buccal cell DNA andhair root DNA. The DNA from these healthy cells defines the CG of thesubject. The cells can be identified as healthy in a variety of ways,e.g., when a person is known to not have cancer or the sample can beobtained from tissue that is not likely to contain cancerous orpremalignant cells (e.g., hair root DNA when liver cancer is suspected).As another example, a plasma sample may be obtained when a patient iscancer-free, and the determined constitutional DNA compared againstresults from a subsequent plasma sample (e.g., a year or more later). Inanother embodiment, a single biologic sample containing <50% of tumorDNA can be used for deducing the constitutional genome and thetumor-associated genetic alterations. In such a sample, theconcentrations of tumor-associated single nucleotide mutations would belower than those of each allele of heterozygous SNPs in the CG. Such asample can be the same as the biological sample used to determine asample genome, described below.

The term “biological sample” as used herein refers to any sample that istaken from a subject (e.g., a human, a person with cancer, a personsuspected of having cancer, or other organisms) and contains one or morecell-free nucleic acid molecule(s) of interest. A biological sample caninclude cell-free DNA, some of which can have originated from healthycells and some from tumor cells. For example, tumor DNA can be found inblood or other fluids, e.g., urine, pleural fluid, ascitic fluid,peritoneal fluid, saliva, tears or cerebrospinal fluid. A non-fluidexample is a stool sample, which may be mixed with diarrheal fluid. Forsome of such samples, the biological sample can be obtainednon-invasively. In some embodiments, the biological sample can be usedas a constitutional sample.

The term “sample genome” (also referred to as SG) is a collection ofsequence reads that have been aligned to locations of a genome (e.g., ahuman genome). The sample genome (SG) is not a consensus sequence, butincludes nucleotides that may appear in only a sufficient number ofreads (e.g., at least 2 or 3, or higher cutoff values). If an alleleappears a sufficient number of times and it is not part of the CG (i.e.,not part of the consensus sequence), then that allele can indicate a“single nucleotide mutation” (also referred to as an SNM). Other typesof mutations can also be detected using the current invention, e.g.mutations involving two or more nucleotides, (such as affect the numberof tandem repeat units in a microsatellite or simple tandem repeatpolymorphism), chromosomal translocation (which can be intrachromosomalor interchromosomal) and sequence inversion.

The term “reference genome” (also referred to as RG) refers to a haploidor diploid genome to which sequence reads from the biological sample andthe constitutional sample can be aligned and compared. For a haploidgenome, there is only one nucleotide at each locus. For a diploidgenome, heterozygous loci can be identified, with such a locus havingtwo alleles, where either allele can allow a match for alignment to thelocus.

The term “level of cancer” can refer to whether cancer exists, a stageof a cancer, a size of tumor, and/or other measure of a severity of acancer. The level of cancer could be a number or other characters. Thelevel could be zero. The level of cancer also includes premalignant orprecancerous conditions (states) associated with mutations or a numberof mutations. The level of cancer can be used in various ways. Forexample, screening can check if cancer is present in someone who is notknown previously to have cancer. Assessment can investigate someone whohas been diagnosed with cancer. Detection can mean ‘screening’ or canmean checking if someone, with suggestive features of cancer (e.g.symptoms or other positive tests), has cancer.

DETAILED DESCRIPTION

Embodiments are provided for the detection of cancer by the analysis ofa biological sample (e.g., a blood plasma/serum sample) that is nottaken directly from a tumor and includes cell-free nucleic acids. Thecell-free nucleic acids can originate for various types of tissuethroughout the body. In this manner, a broad analysis for the detectionof various cancers can be performed.

Genetic aberrations (including single nucleotide mutations, deletions,amplifications, and rearrangements) accumulate in the tumor cells duringthe development of cancers. In embodiments, massively parallelsequencing can be used to detect and quantify the single nucleotidemutations (SNMs), also called single nucleotide variations (SNVs), inbody fluids (e.g. plasma, serum, saliva, ascitic fluid, pleural fluidand cerebrospinal fluid) so as to detect and monitor cancers. Aquantification of the number of SNMs (or other types of mutations) canprovide a mechanism for identifying early stages of cancer as part ofscreening tests. In various implementations, care is taken todistinguish sequencing errors and to distinguish spontaneous mutationsoccurring in healthy cells (e.g., by requiring multiple SNMs to beidentified at a particular locus, e.g., at least 3, 4, or 5).

Some embodiments also provide noninvasive methods for the analysis oftumor heterogeneity, which can involve cells within the same tumor (i.e.intratumoral heterogeneity) or cells from different tumors (from eitherthe same site or from different sites) within a body. For example, onecan noninvasively analyze the clonal structure of such tumorheterogeneity, including an estimation of the relative tumor cell masscontaining each mutation. Mutations that are present in higher relativeconcentrations are present in a larger number of malignant cells in thebody, e.g., cells that have occurred earlier on during the tumorigenicprocess relative to other malignant cells still in the body (Welch J Set al. Cell 2012; 150: 264-278). Such mutations, due to their higherrelative abundance, are expected to exhibit a higher diagnosticsensitivity for detecting cancer DNA than those with lower relativeabundance. A serial monitoring of the change of the relative abundanceof mutations would allow one to noninvasively monitor the change in theclonal architecture of tumors, either spontaneously as the diseaseprogresses, or in response to treatment. Such information would be ofuse in assessing prognosis or in the early detection of tumor resistanceto treatment.

I. Introduction

Mutations can occur during cell division because of errors in DNAreplication and/or DNA repair. One type of such mutations involve thealteration of single nucleotides, which can involve multiple sequencesfrom different parts of the genome. Cancers are generally believed to bedue to the clonal expansion of a single cancer cell which has acquiredgrowth advantage. This clonal expansion would lead to the accumulationof mutations (e.g. single nucleotide mutations) in all the cancer cellsoriginating from the ancestral cancer cell. These progeny tumor cellswould share a set of mutations (e.g. single nucleotide mutations). Asdescribed herein, cancer-associated single nucleotide mutations aredetectable in the plasma/serum of cancer patients.

Some embodiments can effectively screen for all mutations in abiological sample (e.g., the plasma or serum). As the number ofmutations are not fixed (hundreds, thousands, or millions ofcancer-associated mutations from different subpopulations of tumor cellscan be detected), embodiments can provide a better sensitivity thantechniques that detect specific mutations. The number of mutations canbe used to detect cancer.

To provide such a screening of many or all mutations, embodiments canperform a search (e.g., a random search) for genetic variations in abiological sample (e.g., bodily fluids, including plasma and serum),which could contain tumor-derived DNA. The use of a sample, such asplasma, obviates the need to perform an invasive biopsy of the tumor orcancer. Also, as the screening can cover all or large regions of thegenome, the screening is not limited to any enumerable and knownmutations, but can use the existence of any mutation. Moreover, sincethe number of mutations is summed across all or large regions of thegenome, a higher sensitivity can be obtained.

However, there are polymorphic sites, including single nucleotidepolymorphisms (SNPs), in the human genome, which should not be countedin the mutations. Embodiments can ascertain whether genetic variationsthat have been detected are likely to be cancer-associated mutations orare polymorphisms in the genome. For example, as part of determiningbetween cancer-associated mutations and polymorphisms in the genome,embodiments can determine a constitutional genome, which can includepolymorphisms. The polymorphisms of the constitutional genome (CG) canbe confined to polymorphisms that are exhibited with a sufficiently highpercentage (e.g., 30-40%) in the sequencing data.

The sequences obtained from the biological sample can then be aligned tothe constitutional genome and variations that are single nucleotidemutations (SNMs), or other types of mutations, identified. These SNMswould be variations that are not included in the known polymorphisms,and thus can be labeled as cancer-associated, and not part of theconstitutional genome. A healthy person may have a certain number ofSNMs due to random mutations among healthy cells, e.g., created duringcell division, but a person with cancer would have more.

For example, for a person with cancer, the number of SNMs detectable ina bodily fluid would be higher than the polymorphisms present in theconstitutional genome of the same person. A comparison can be madebetween the amounts of variations detected in a bodily fluid samplecontaining tumor-derived DNA and a DNA sample containing mostlyconstitutional DNA. In one embodiment, the term ‘mostly’ would mean morethan 90%. In another preferred embodiment, the term ‘mostly’ would meanmore than 95, 97%, 98%, or 99%. When the amount of variations in thebodily fluid exceeds that of the sample with mostly constitutional DNA,there is an increased likelihood that the bodily fluid might containtumor-derived DNA.

One method that could be used to randomly search for variations in DNAsamples is random or shotgun sequencing (e.g., using massively parallelsequencing). Any massively parallel sequencing platform may be used,including a sequencing-by-ligation platform (e.g. the Life TechnologiesSOLiD platform), the Ion Torrent/Ion Proton, semiconductor sequencing,Roche 454, single molecular sequencing platforms (e.g. Helicos, PacificBiosciences and nanopore). Yet, it is known that sequencing errors canoccur and may be misinterpreted as a variation in the constitutional DNAor as mutations derived from tumor DNA. Thus, to improve the specificityof our proposed approach, the probability of the sequencing error orother components of analytical errors can be accounted for, e.g., byusing an appropriate sequencing depth along with requiring at least aspecified number (e.g., 2 or 3) of detected alleles at a locus for it tobe counted as an SNM.

As described herein, embodiments can provide evidence for the presenceof tumor-derived DNA in a biological sample (e.g., a bodily fluid) whenthe amount of randomly detected genetic variations present in the sampleexceeds that expected for constitutional DNA and variations that may beinadvertently detected due to analytical errors (e.g., sequencingerrors). The information could be used for the screening, diagnosis,prognostication and monitoring of cancers. In the following sections, wedescribe analytical steps that can be used for the detection of singlenucleotide mutations in plasma/serum or other samples (e.g., bodilyfluids). Bodily fluids could include plasma, serum, cerebrospinal fluid,pleural fluid, ascitic fluid, nipple discharge, saliva, bronchoalveolarlavage fluid, sputum, tears, sweat and urine. In addition to bodilyfluids, the technology can also be applied to stools sample, as thelatter has been shown to contain tumor DNA from colorectal cancer(Berger B M, Ahlquist D A. Pathology 2012; 44: 80-88).

II. General Screening Method

FIG. 1 is a flowchart of a method 100 for detecting cancer orpremalignant change in a subject according to embodiments of the presentinvention. Embodiments can analyze cell-free DNA in a biological samplefrom the subject to detect variations in the cell-free DNA likelyresulting from a tumor. The analysis can use a constitutional genome ofthe subject to account for polymorphisms that are part of healthy cells,and can account for sequencing errors. Method 100 and any of the methodsdescribed herein may be totally or partially performed with a computersystem including one or more processors.

In step 110, a constitutional genome of the subject is obtained. Theconstitutional genome (CG) can be determined from the constitutional DNAof the tested subject. In various embodiments, the CG can be read frommemory or actively determined, e.g., by analyzing sequence reads ofconstitutional DNA, which may be in cells from the sample that includesthe cell-free DNA. For example, when a non-hematological malignancy issuspected, blood cells can be analyzed to determine the constitutionalDNA of the subject.

In various implementations, the analysis of the constitutional DNA couldbe performed using massively parallel sequencing, array-basedhybridization, probe-based in-solution hybridization, ligation-basedassays, primer extension reaction assays, and mass spectrometry. In oneembodiment, the CG can be determined at one time point in a subject'slife, e.g., at birth or even in the prenatal period (which could be doneusing fetal cells or via cell-free DNA fragment, see U.S. Publication2011/0105353), and then be referred to when bodily fluids or othersamples are obtained at other times of the subject's life. Thus, the CGmay simply be read from computer memory. The constitutional genome maybe read out as a list of loci where the constitutional genome differsfrom a reference genome.

In step 120, one or more sequence tags are received for each of aplurality of DNA fragments in a biological sample of the subject, wherethe biological sample includes cell-free DNA. In one embodiment, the oneor more sequence tags are generated from a random sequencing of DNAfragments in the biological sample. More than one sequence tag may beobtained when paired-end sequencing is performed. One tag wouldcorrespond to each end of the DNA fragment.

The cell-free DNA in the sample (e.g., plasma, serum or other bodyfluid) can be analyzed to search for genetic variations. The cell-freeDNA can be analyzed using the same analytical platform as that has beenused to analyze the constitutional DNA. Alternatively, a differentanalytical platform could be used. For example, the cell-free DNA samplecan be sequenced using massively parallel sequencing or parts of thegenome could be captured or enriched before massively parallelsequencing. If enrichment is used, one could, for example, usesolution-phase or solid-phase capture of selected parts of the genome.Then, massively parallel sequencing can be carried out on the capturedDNA.

In step 130, genomic positions for the sequence tags are determined. Inone embodiment, the sequence tags are aligned to a reference genome,which is obtained from one or more other subjects. In anotherembodiment, the genomic sequence tags are aligned to the constitutionalgenome of the tested subject. The alignment can be performed usingtechniques known to one skilled in the art, e.g., using Basic LocalAlignment Search Tool (BLAST).

In step 140, a first number of loci are determined where at least Nsequence tags have a sequence variant relative to the constitutionalgenome (CG). N is equal to or greater than two. As discussed in moredetail below, sequencing errors as well as somatic mutations occurringrandomly in cells (e.g., due to cell division) can be removed by havingN equal 2, 3, 4, 5, or higher. The loci that satisfy one or morespecified criteria can be identified as a mutation (variant) or mutationloci (variant loci), whereas a locus having a variant but not satisfyingthe one or more criteria (e.g., as just one variant sequence tag) isreferred to as a potential or putative mutation. The sequence variantcould be for just one nucleotide or multiple nucleotides.

N may be determined as percentage of total tags for a locus, as opposedto an absolute value. For example, a variant locus can be identifiedwhen the fractional concentration of tumor DNA inferred from the variantreads is determined to be equal to or greater than 10% (or some otherpercentage). In other words, when the locus is covered by 200 sequencereads, a criterion of at least 10 sequence reads showing the variantallele can be required to define the variant as a mutation. The 10sequence reads of the variant allele and 190 reads of the wildtypeallele would give an fractional concentration of tumor DNA of 10%(2×10/(10+190)).

In one embodiment, the sequence tags (collectively referred to as thesample genome) can be compared directly to the CG to determine thevariants. In another embodiment, the sample genome (SG) is compared tothe CG via a reference genome (RG) to determine the variants. Forexample, both the CG and SG can be compared to the RG to determinerespective numbers (e.g., sets) of loci exhibiting variants, and then adifference can be taken to obtain the first number of loci. The firstnumber can simply be obtained as a number or may correspond to aspecific set of loci, which may then be analyzed further to determine aparameter from the sequence tags at the first loci.

In one implementation, sequencing results of constitutional DNA andplasma DNA are compared to determine if a single nucleotide mutation ispresent in the plasma DNA. The regions at which the constitutional DNAis homozygous can be analyzed. For illustration purposes, assume thegenotype of a particular locus is homozygous in the constitutional DNAand is AA. Then in the plasma, the presence of an allele other than Awould indicate the potential presence of a single nucleotide mutation(SNM) at the particular locus. The loci indicating the potentialpresence of an SNM can form the first number of loci in step 140.

In one embodiment, it could be useful to target parts of the genome thatare known to be particularly prone to mutation in a particular cancertype or in a particular subset of the population. Of relevance to thelatter aspect, embodiments can look for types of mutations that areparticularly prevalent in a specific population group, e.g. mutationsthat are especially common in subjects who are carriers of hepatitis Bvirus (for liver cancer) or human papillomavirus (for cervical cancer)or who have genetic predisposition to somatic mutations or subjects withgermline mutations in a DNA mismatch repair gene. The technology wouldalso be useful to screen for mutations in ovarian and breast cancers insubjects with BRCA1 or BRCA2 mutations. The technology would similarlybe useful to screen for mutations in colorectal cancer in subjects withAPC mutations.

In step 150, a parameter is determined based on a count of sequence tagshaving a sequence variant at the first loci. In one example, theparameter is the first number of loci where at least N DNA fragmentshave a sequence variant at a locus relative to the constitutionalgenome. Thus, the count can be used simply to ensure that a locus hasmore than N copies of a particular variant identified before beingincluded in the first number. In another embodiment, the parameter canbe or include the total number of sequence tags having a sequencevariant relative to the constitutional genome at the first loci.

In step 160, the parameter for the subject is compared to a thresholdvalue (e.g., derived from one or more other subjects) to determine aclassification of a level of cancer in the subject. Examples of a levelof cancer includes whether the subject has cancer or a premalignantcondition, or an increased likelihood of developing cancer. In oneembodiment, the threshold value may be determined from a previouslyobtained sample from the subject.

In another embodiment, the one or more other subjects may be determinedto not have cancer or a low risk of cancer. Thus, threshold value may bea normal value, a normal range, or indicate a statistically significantdeviation from a normal value or range. For example, the number ofmutations relative to the CG of a specific subject, detectable in theplasma of subjects without a cancer or with a low risk of cancer, can beused as the normal range to determine if the number of mutationsdetected in the tested subject is normal. In another embodiment, theother subjects could be known to have cancer, and thus a similar numberof mutations can indicate cancer.

In one implementation, the other subjects can be selected to haveclinical characteristics that are matched to those of the test subject,e.g. sex, age, diet, smoking habit, drug history, prior disease, familyhistory, genotypes of selected genomic loci, status for viral infections(e.g. hepatitis B or C virus or human papillomavirus or humanimmunodeficiency virus or Epstein-Barr virus infection) or infectionswith other infectious agents (such as bacteria (e.g. Helicobacterpylori) and parasites (e.g. Clonorchis sinensis), etc. For example,subjects who are carriers of hepatitis B or C virus have an increasedrisk of developing hepatocellular carcinoma. Thus, test subjects whohave a similar number or pattern of mutations as a carrier of hepatitisB or C can be considered to have an increased risk of developinghepatocellular carcinoma. On the other hand, a hepatitis B or C patientwho exhibits more mutations than another hepatitis patient can properlybe identified as having a higher classification of a level of cancer,since the proper baseline (i.e. relative to another hepatitis patient)is used. Similarly, subjects who are carriers of human papillomavirusinfection have increased risk for cervical cancer, and head and neckcancer. Infection with the Epstein-Barr virus has been associated withnasopharyngeal carcinoma, gastric cancer, Hodgkin's lymphoma andnon-Hodgkin's lymphoma. Infection with Helicobacter pylori has beenassociated with gastric cancer. Infection with Clonorchis sinensis hasbeen associated with cholangiocarcinoma.

The monitoring of the changes of the number of mutations at differenttime points can be used for monitoring of the progress of the cancer andthe treatment response. Such monitoring can also be used to document theprogress of a premalignant condition or change in the risk that asubject would develop cancer.

The amount of sequence tags showing variations can also be used tomonitor. For example, a fractional concentration of variant reads at alocus can be used. In one embodiment, an increase in the fractionalconcentrations of tumor-associated genetic aberrations in the samplesduring serial monitoring can signify the progression of the disease orimminent relapse. Similarly, a decrease in the fractional concentrationsof tumor-associated genetic aberrations in the samples during serialmonitoring can signify response to treatment and/or remission and/orgood prognosis.

III. Determining Genomes

The various genomes discussed above are explained in more detail below.For example, the reference genome, constitutional genome, and the samplegenome are discussed.

A. Reference Genome

The reference genome (RG) refers to a haploid or diploid genome of asubject or consensus of a population. The reference genome is known andthus can be used to compare sequencing reads from new patients. Thesequence reads from a sample of a patient can be aligned and compared toidentify variations in the reads from the RG. For a haploid genome,there is only one nucleotide at each locus, and thus each locus can beconsidered hemizygous. For a diploid genome, heterozygous loci can beidentified, with such a locus having two alleles, where either allelecan allow a match for alignment to the locus.

A reference genome can be the same among a population of subjects. Thissame reference genome can be used for healthy subjects to determine theappropriate threshold to be used for classifying the patient (e.g.,having cancer or not). However, different reference genomes can be usedfor different populations, e.g., for different ethnicities or even fordifferent families.

B. Constitutional Genome

The constitutional genome (CG) for a subject (e.g., a human or otherdiploid organism) refers to a diploid genome of the subject. The CG canspecify heterozygous loci where a first allele is from a first haplotypeand a different second allele is from a second haplotype. Note that thestructures of two haplotypes that cover two heterozygous loci need notbe known, i.e., which allele on one heterozygous locus is on the samehaplotype as an allele of another heterozygous locus need not be known.Just the existence of the two alleles at each heterozygous locus can besufficient.

The CG can differ from the RG due to polymorphisms. For example, a locuson the RG can be homozygous for T, but the CG is heterozygous for T/A.Thus, the CG would exhibit a variation at this locus. The CG can also bedifferent from the RG due to inherited mutations (e.g., that run infamilies) or de novo mutations (that occur in a fetus, but which are notpresent in its parents). The inherited mutation is typically called‘germline mutation’. Some of such mutations are associated withpredisposition to cancer, such as a BRCA1 mutation that runs in afamily. Such mutations are different from ‘somatic mutations’ that canoccur due to cell division during one's lifetime and can push a cell andits progeny on the way to become a cancer.

A goal of determining the CG is to remove such germline mutations and denovo mutations from the mutations of the sample genome (SG) in order toidentify the somatic mutations. The amount of somatic mutations in theSG can then be used to assess the likelihood of cancer in the subject.These somatic mutations can be further filtered to remove sequencingerrors, and potentially to remove somatic mutations that occur rarely(e.g., only one read showing a variant), as such somatic mutations arenot likely related to cancer.

In one embodiment, a CG can be determined using cells (buffy coat DNA).However, the CG can also be determined from cell-free DNA (e.g. plasmaor serum) as well. For a sample type in which most of the cells arenon-malignant, e.g. the buffy coat from a healthy subject, then themajority or consensus genome is the CG. For the CG, each genomic locusconsists of the DNA sequence possessed by the majority of cells in thesampled tissue. The sequencing depth should be sufficient to elucidateheterozygous sites within the constitutional genome.

As another example, plasma can be used as the constitutional sample todetermine the CG. For example, for cases in which the tumor DNA inplasma is less than 50% and an SNM is in a heterozygous state, e.g., themutation is the addition of a new allele, then the new allele can have aconcentration of less than 25%. Whereas, the concentration of theheterozygous alleles of SNPs in the CG should amount to approximately50%. Thus, a distinction can be made between a somatic mutation and apolymorphism of the CG. In one implementation, a suitable cutoff can bebetween 30-40% for determining a somatic mutation from a polymorphismwhen using plasma, or other mixtures with significant tumorconcentration. A measurement of tumor DNA concentration can be useful toensure that the tumor DNA in plasma is less than 50%. Examples ofdetermining a tumor DNA concentration are described herein.

C. Sample Genome

The sample genome (SG) is not simply a haploid or diploid genome as isthe case for the RG and CG. The SG is a collection of reads from thesample, and can include: reads from constitutional DNA that correspondto the CG, reads from tumor DNA, reads from healthy cells that showrandom mutations relative to the CG (e.g., due to mutations resultingfrom cell division), and sequencing errors. Various parameters can beused to control exactly which reads are included in the SG. For example,requiring an allele to show up in at least 5 reads can decrease thesequencing errors present in the SG, as well as decrease the reads dueto random mutations.

As an example, assume the subject is healthy, i.e., does not havecancer. For illustration purposes, the DNA from 1000 cells is in 1 ml ofplasma (i.e. 1000 genome-equivalents of DNA) obtained from this subject.Plasma DNA typically consists of DNA fragments of about 150 bp. As thehuman genome is 3×10⁹ bp, there would be about 2×10⁷ DNA fragments perhaploid genome. As the human genome is diploid, there would be about4×10⁷ DNA fragments per ml of plasma.

As millions to billions of cells are releasing their DNA in the plasmaper unit time and fragments from these cells would mix together duringcirculation, the 4×10⁷ DNA fragments could have come from 4×10⁷different cells. If these cells do not bear a recent (as opposed todistant, e.g., the original zygote) clonal relationship to each other(i.e. that they do not share a recent ancestral cell), then it isstatistically likely that no mutation will be seen more than onceamongst these fragments.

On the other hand, if amongst the 1000 genome-equivalents per ml ofplasma DNA, there is a certain percentage of cells that share a recentancestral cell (i.e., they are related to each other clonally), then onecould see the mutations from this clone to be preferentially representedin the plasma DNA (e.g. exhibiting a clonal mutational profile inplasma). Such clonally related cells could be cancer cells, or cellsthat are on their way to become a cancer but not yet there (i.e.pre-neoplastic). Thus, requiring a mutation to show up more than oncecan remove this natural variance in the “mutations” identified in thesample, which can leave more mutations related to cancer cells orpre-neoplastic cells, thereby allowing detection, especially earlydetection of cancer or precancerous conditions.

In one approximation, it has been stated that on average, one mutationwill be accumulated in the genome following every cell division.Previous work has shown that most of the plasma DNA is fromhematopoietic cells (Lui Y Y et al. Clin Chem 2002: 48: 421-427). It hasbeen estimated that hematopoietic stem cells replicate once every 25-50weeks (Catlin S N, et al. Blood 2011; 117: 4460-4466). Thus, as asimplistic approximation, a healthy 40-year-old subject would haveaccumulated some 40 to 80 mutations per hematopoietic stem cell.

If there are 1000 genome-equivalents per ml in this person's plasma, andif each of these cells is derived from a different hematopoietic stemcell, then 40,000 to 80,000 mutations might be expected amongst the4×10¹⁰ DNA fragments (i.e. 4×10⁷ DNA fragments per genome, and 1000genome-equivalents per ml of plasma). However, as each mutation would beseen once, each mutation can still be below a detection limit (e.g., ifcutoff value N is greater than 1), and thus these mutations can befiltered out, thereby allowing the analysis to focus on mutations thatare more likely to result from cancerous conditions. The cutoff valuecan be any value (integer or non-integer) greater than one, and may bedynamic for different loci and regions. The sequencing depth andfractional concentration of tumor DNA can also affect the sensitivity ofdetecting mutations (e.g., percentage of mutations detectable) fromcancer cells or pre-neoplastic cells.

IV. Comparing SG Directly to CG

Some embodiments can identify nucleotide positions that the CG ishomozygous, but where a minority species (i.e. the tumor DNA) in the SGis heterozygous. When sequencing a position in high depth (e.g., over50-fold coverage), one can detect if there are one or two alleles atthat position in the DNA mixture of healthy and cancer cells. When thereare two alleles detected, either (1) the CG is heterozygous or (2) theCG is homozygous but the SG is heterozygous. These two scenarios can bedifferentiated by looking at the relative counts of the major and theminor alleles. In the former scenario, the two alleles would havesimilar numbers of counts; but for the latter scenario, there would be alarge difference in their numbers of counts. This comparison of therelative allele counts of the reads from the test sample is oneembodiment for comparing sequence tags to the constitutional genome. Thefirst loci of method 100 can be determined as loci where the number ofalleles is below an upper threshold (threshold corresponding to apolymorphism in the CG) and above a lower threshold (thresholdcorresponding to errors and somatic mutations occurring at asufficiently low rate to not be associated with a cancerous condition).Thus, the constitutional genome and the first loci can be determined atthe same time.

In another embodiment, a process for identifying mutations can determinethe CG first, and then determine loci having a sufficient number ofmutations relative to the CG. The CG can be determined from aconstitutional sample that is different from the test sample.

FIG. 2 shows a flowchart of a method 200 comparing the sample genome(SG) directly to the constitutional genome (CG) according to embodimentsof the present invention. At block 210, a constitutional genome of thesubject is obtained. The constitutional genome can be obtained, forexample, from a sample taken previously in time or a constitutionalsample that is obtained and analyzed just before method 200 isimplemented.

At block 220, one or more sequence tags are received for each of aplurality of DNA fragments in a biological sample of the subject. Thesequencing may be performed using various techniques, as mentionedherein. The sequence tags are a measurement of what the sequence of afragment is believed to be. But, one or more bases of a sequence tag maybe in error.

At block 230, at least a portion of the sequence tags are aligned to theconstitutional genome. The alignment can account for the CG beingheterozygous at various loci. The alignment would not require an exactmatch so that variants could be detected.

At block 240, sequence tags that have a sequence variant at a locusrelative to the constitutional genome are identified. It is possiblethat a sequence tag could have more than one variant. The variants foreach locus and for each sequence tag can be tracked. A variant could beany allele that is not in the CG. For example, the CG could beheterozygous for A/T and the variant could be G or C.

At block 250, for each locus with a variant, a computer system can counta respective first number of sequence tags that align to the locus andhave a sequence variant at the locus. Thus, each locus can have anassociated count of the number of variants seen at the locus. Typically,fewer variants will be seen at a locus compared to sequence tags thatcorrespond to the CG, e.g., due to the tumor DNA concentration beingless than 50%. However, some samples may have a concentration of tumorDNA that is greater than 50%.

At block 260, a parameter is determined based on the respective firstnumbers. In one embodiment, if a respective number is greater than acutoff value (e.g., greater than two), then the respective number can beadded to a sum, which is the parameter or is used to determine theparameter. In another embodiment, the number of loci having a respectivenumber greater than the cutoff value is used as the parameter.

At block 270, the parameter is compared to a threshold value to classifya level of cancer. As described above, the threshold value may bedetermined from the analysis of samples from other subjects. Dependingon the healthy or cancer state of these other subjects, theclassification can be determined. For example, if the other subjects hadstage 4 cancer, then if the current parameter was close (e.g., within aspecific range) to the value of the parameter obtained from the othersubjects, then the current subject might be classified as having stage 4cancer. However, if the parameter is exceeds the threshold (i.e.,greater than or less, depending on how the parameter is defined), thenthe classification can be identified as being less than stage 4. Asimilar analysis can be made when the other subjects do not have cancer.

Multiple thresholds may be used to determine the classification, whereeach threshold is determined from a different set of subjects. Each setof subjects may have a common level of cancer. Thus, the currentparameter may be compared to the values for each set of subjects, whichcan provide a match to one of the sets or provide a range. For example,the parameter might be about equal to the parameter obtained forsubjects that are precancerous or at stage 2. As another example, thecurrent parameter can fall in a range that can possibly match to severaldifferent levels of cancer. Thus, the classification can include morethan one level of cancer.

V. Using Reference Genome

The genomic sequences of both the constitutional DNA and the DNA fromthe biological sample can be compared to the human reference genome.When there are more changes in the plasma sample than the constitutionalDNA as compared with the reference genome, then there is a higherprobability for cancer. In one embodiment, the homozygous loci in thereference genome are studied. The amounts of heterozygous loci in boththe constitutional DNA and DNA from the biological sample are compared.When the amount of heterozygous sites detected from the DNA of thebiological sample exceeds that of the constitutional DNA, there is ahigher probability of cancer.

The analysis could also be limited to loci that are homozygous in theCG. SNMs can be defined for heterozygous loci as well, but this wouldgenerally require the generation of a third variant. In other words, ifthe heterozygous locus is A/T, a new variant would be either C or G.Identifying SNMs for homozygous loci is generally easier.

The degree to which an increase in the amount of heterozygous loci inthe biological sample DNA relative to the constitutional DNA can besuggestive of cancer or a premalignant state when compared to the rateof change seen in healthy subjects. For example, if the degree ofincrease in such sites exceeds that observed in healthy subjects by acertain threshold, one can consider the data to be suggestive of canceror a premalignant state. In one embodiment, the distribution ofmutations in subjects without cancer is ascertained and a threshold canbe taken as a certain number of standard deviations (e.g., 2 or 3standard deviations).

One embodiment can require at least a specified number of variants at alocus before that locus is counted. Another embodiment provides a testeven for the data based on seeing a change once. For example, when thetotal number of variations (errors+genuine mutations or polymorphisms)seen in plasma is statistically significantly higher than that in theconstitutional DNA, then there is evidence for cancer.

FIG. 3 shows a flowchart of a method 300 comparing the sample genome(SG) to the constitutional genome (CG) using the reference genome (RG)according to embodiments of the present invention. Method 300 assumesthat the RG is already obtained, and that the sequence tags for thebiological sample have already been received.

At block 310, at least a portion of the sequence tags are aligned to thereference genome. The alignment can allow mismatches as variations arebeing detected. The reference genome can be from a similar population asthe subject. The aligned sequence tags effectively comprise the samplegenome (SG)

At block 320, a first number (A) of potential variants, e.g., singlenucleotide mutations (SNMs), are identified. The potential SNMs are lociwhere a sequence tag of the SG shows a nucleotide that is different fromthe RG. Other criteria may be used, e.g., the number of sequence tagsshowing a variation must be greater than a cutoff value and whether alocus is homozygous in the RG. The set of potential SNMs may berepresented as set A when specific loci are identified and tracked bystoring the loci in memory. The specific loci may be determined orsimply a number of such SNMs can be determined.

At block 330, a constitutional genome is determined by aligning sequencetags obtained by sequencing DNA fragments from a constitutional sampleto a reference genome. This step could have been performed at any timepreviously and using a constitutional sample obtained at any timepreviously. The CG could simply be read from memory, where the aligningwas previously done. In one embodiment, the constitutional sample couldbe blood cells.

At block 340, a second number (B) of loci where an aligned sequence tagof the CG has a variant (e.g., an SNM) at a locus relative to thereference genome are identified. If a set of loci is specificallytracked, then B can represent the set, as opposed to just a number.

At block 350, set B is subtracted from set A to identify variants (SNMs)that are present in the sample genome but not in CG. In one embodiment,the set of SNMs can be limited to nucleotide positions that the CG ishomozygous. To achieve this filtering, specific loci where the CG ishomozygous can be identified in set C. In another embodiment, a locus isnot counted in the first number A or the second number B, if the CG isnot homozygous at the locus. In another embodiment, any knownpolymorphism (e.g. by virtue of its presence in a SNP database) can befiltered out.

In one embodiment, the subtraction in block 350 can simply be asubtraction of numbers, and thus specific potential SNMs are notremoved, but simply a value is subtracted. In another embodiment, thesubtraction takes a difference between set A and set B (e.g., where setB is a subset of set A) to identify the specific SNMs that are not inset B. In logical values, this can be expressed as [A AND NOT(B)]. Theresulting set of identified variants can be labeled C. The parameter canbe determined as the number C or determined from the set C.

In some embodiments, the nature of the mutations can be taken intoconsideration and different weighting attributed to different classes ofmutations. For example, mutations that are commonly associated withcancer can be attributed a higher weighting (also called an importancevalue when referring to relative weightings of loci). Such mutations canbe found in databases of tumor-associated mutations, e.g., the Catalogueof Somatic Mutations in Cancer (COSMIC)(www.sanger.ac.uk/genetics/CGP/cosmic/). As another example, mutationsassociated with non-synonymous changes can be attributed a higherweighting.

Thus, the first number A could be determined as a weighted sum, wherethe count of tags showing a variant at one locus may have a differentweighting than the count of tags at another locus. The first number Acan reflect this weighted sum. A similar calculation can be performed B,and thus the number C and the parameter can reflect this weighting. Inanother embodiment, the weightings are accounted for when a set C ofspecific loci is determined. For example, a weighted sum can bedetermined for the counts for the loci of set C. Such weights can beused for other methods described herein.

Accordingly, the parameter that is compared to a threshold to determinethe classification of a level of cancer can be the number of lociexhibiting a variation for the SG and the CG relative to the RG. Inother embodiments, the total number of DNA fragments (as counted via thesequence tags) showing a variation can be counted. In other embodiments,such numbers can be used in another formula to obtain the parameter.

In one embodiment, the concentration of the variant at each locus can bea parameter and compared with a threshold. This threshold can be used todetermine if a locus is a potential variant locus (in addition to thecutoff of a specific number of reads showing the variant), and then havethe locus be counted. The concentration could also be used as aweighting factor in a sum of the SNMs.

VI. Decreasing False Positives Using Cutoff Values

As mentioned above, the single nucleotide mutations can be surveyed in alarge number of cell-free DNA fragments (e.g. circulating DNA in plasma)for a large genomic region (e.g. the entire genome) or a number ofgenomic regions to improve the sensitivity of the approach. However,analytical errors such as sequencing errors can affect the feasibility,accuracy and specificity of this approach. Here, we use the massivelyparallel sequencing platform as an example to illustrate the importanceof sequencing errors. The sequencing error rate of the Illuminasequencing-by-synthesis platform is approximately 0.1% to 0.3% persequenced nucleotide (Minoche et al. Genome Biol 2011, 12:R112). Anymassively parallel sequencing platform may be used, including asequencing-by-ligation platform (e.g. the Life Technologies SOLiDplatform), the Ion Torrent/Ion Proton, semiconductor sequencing, Roche454, single molecular sequencing platforms (e.g. Helicos, PacificBiosciences and nanopore).

In a previous study on hepatocellular carcinoma, it was shown that thereare approximately 3,000 single nucleotide mutations for the whole cancergenome (Tao Y et al. 2011 Proc Natl Acad Sci USA; 108: 12042-12047).Assuming that only 10% of the total DNA in the circulation is derivedfrom the tumor cells and we sequence the plasma DNA with an averagesequencing depth of one fold haploid genome coverage, we would encounter9 million (3×10⁹×0.3%) single nucleotide variations (SNVs) due tosequencing errors. However, most of the single nucleotide mutations areexpected to occur on only one of the two homologous chromosomes. With asequencing depth of one-fold haploid genome coverage of a sample with100% tumor DNA, we expect to detect only half of the 3,000 mutations,i.e. 1,500 mutations. When we sequence the plasma sample containing 10%tumor-derived DNA to one haploid genome coverage, we expect to detectonly 150 (1,500×10%) cancer-associated single nucleotide mutations.Thus, the signal-to-noise ratio for the detection of cancer-associatedmutations is 1 in 60,000. This very low signal-to-noise ratio suggeststhat the accuracy of using this approach for differentiating normal andcancer cases would be very low if we simply use all the singlenucleotide changes in the biological sample (e.g., plasma) as aparameter.

It is expected that with the progress in sequencing technologies, therewould be continual reduction in the sequencing error rate. One can alsoanalyze the same sample using more than one sequencing platform andthrough a comparison of the cross-platform sequencing results, pinpointthe reads likely to be affected by sequencing errors. Another approachis to analyze two samples taken at different times from the samesubject. However, such approaches are time consuming.

In one embodiment, one way to enhance the signal-to-noise ratio in thedetection of single nucleotide mutations in the plasma of cancerpatients is to count a mutation only if there are multiple occurrencesof the same mutation in the sample. In selected sequencing platforms,the sequencing errors involving particular nucleotide substitutionsmight be more common and would affect the sequencing results of the testsample and the constitutional DNA sample of both the test subject andthe control subjects. However, in general, sequencing errors occurrandomly.

The chance of having a sequencing error is exponentially lower when oneobserves the same change at the same nucleotide position in multiple DNAfragments. On the other hand, the chance of detecting a genuinecancer-associated mutational change in the sample is affected by thesequencing depth and the fractional concentration of the tumoral DNA inthe sample. The chance of observing the mutation in multiple DNAfragments would increase with the sequencing depth and fractionalconcentration of tumoral DNA. In various embodiments using samples withcell-free tumoral DNA (such as in plasma), the fractional concentrationcan be 5%, 10%, 20%, and 30%. In one embodiment, the fractionalconcentration is less than 50%.

FIG. 4 is a table 400 showing the number of cancer-associated singlenucleotide mutations correctly identified using different number ofoccurrences as the criterion for classifying a mutation as being presentin the sample according to embodiments of the present invention. Thenumbers of nucleotide positions that are falsely identified as havingmutation because of sequencing error based on the same classificationcriteria are also shown. The sequencing error rate is assumed to be 0.1%(Minoche et al. Genome Bio 2011, 12:R112). The fractional concentrationof tumor-derived DNA in the sample is assumed to be 10%.

FIG. 4 shows that the ratio between the number of cancer-associatedmutations detected in the plasma and the number of false-positive callswould increase exponentially with the increasing number of times thesame change is seen in the sample for defining a mutation, when thefractional concentration of tumor-derived DNA in the sample is assumedto be 10%. In other words, both the sensitivity and specificity forcancer mutation detection would improve. In addition, the sensitivityfor detecting the cancer-associated mutations is affected by thesequencing depth. With 100-fold haploid genome coverage of sequencing,2,205 (73.5%) of the 3,000 mutations can be detected even using thecriterion of the occurrence of the particular mutation in at least 4 DNAfragments in the sample. Other values for the minimum number offragments may be used, such as 3, 5, 8, 10, and greater than 10.

FIG. 5 is a table 500 showing the expected number of false-positive lociand the expected number of mutations identified when the fractionalconcentration of tumor-derived DNA in the sample is assumed to be 5%.With a lower fractional concentration of tumor-derived DNA in thesample, a higher sequencing depth would be required to achieve the samesensitivity of detecting the cancer-associated mutations. A morestringent criterion would also be required to maintain the specificity.For example, the criterion of the occurrence of the particular mutationin at least 5 DNA fragments, instead of the criterion of at least 4occurrences in the situation of 10% tumor DNA fraction, in the samplewould need to be used. Tables 400 and 500 provide guidance for thecutoff value to use given the fold coverage and a tumor DNAconcentration, which can be assumed or measured as described herein.

Another advantage of using the criteria of detecting a single nucleotidechange more than one time to define a mutation is that this is expectedto minimize false positives detection because of single nucleotidechanges in non-malignant tissues. As nucleotide changes can occur duringmitosis of normal cells, each healthy cell in the body can harbor anumber of single nucleotide changes. These changes may potentially leadto false positive results. However, the changes of a cell would bepresent in the plasma/serum when the cell dies. While different normalcells are expected to carry different sets of mutations, the mutationsoccurring in one cell are unlikely to be present in numerous copies inthe plasma/serum. This is in contrast to mutations within tumor cellswhere multiple copies are expected to be seen in plasma/serum becausetumor growth is clonal in nature. Thus, multiple cells from a clonewould die and release the signature mutations representative of theclones.

In one embodiment, target enrichment for specific genomic regions can beperformed before sequencing. This target enrichment step can increasethe sequencing depth of the regions of interest with the same totalamount of sequencing performed. In yet another embodiment, a round ofsequencing with relatively low sequencing depth can first be performed.Then regions showing at least one single nucleotide change can beenriched for a second round of sequencing which has higher foldcoverage. Then, the criterion of multiple occurrences can be applied todefine a mutation for the sequencing results with target enrichment.

VII. Dynamic Cutoffs

As described above, a cutoff value N for the number of reads supportinga variant (potential mutation) can be used to determine whether a locusqualifies as a mutation (e.g., an SNM) to be counted. Using such acutoff can reduce false positives. The discussion below provides methodsfor selecting a cutoff for different loci. In the following embodiments,we assume that there is a single predominant cancer clone. Similaranalysis can be carried out for scenarios involving multiple clones ofcancer cells releasing different amounts of tumor DNA into the plasma.

A. Number of Cancer-Associated Mutations Detected in Plasma

The number of cancer-associated mutations detectable in plasma can beaffected by a number of parameters, for example: (1) The number ofmutations in the tumor tissue (N_(T))—the total number of mutationspresent in the tumor tissue is the maximum number of tumor-associatedmutations detectable in the plasma of the patient; (2) The fractionalconcentration of tumor-derived DNA in the plasma (f)—the higher thefractional concentration of tumor-derived DNA in the plasma, the higherthe chance of detecting the tumor-associated mutations in the plasmawould be; (3) Sequencing depth (D)—Sequencing depth refers to the numberof times the sequenced region is covered by the sequence reads. Forexample, an average sequencing depth of 10-fold means that eachnucleotide within the sequenced region is covered on average by 10sequence reads. The chance of detecting a cancer-associated mutationwould increase when the sequencing depth is increased; and (4) Theminimum number of times a nucleotide change that is detected in theplasma so as to define it as a potential cancer-associated mutation (r),which is a cutoff value used to discriminate sequencing errors from realcancer-associated mutations.

In one implementation, the Poisson distribution is used to predict thenumber of cancer-associated mutations detected in plasma. Assuming thata mutation is present in a nucleotide position on one of the twohomologous chromosomes, with a sequencing depth of D, the expectednumber of times a mutation is present in the plasma (M_(P)) iscalculated as: M_(P)=D×f/2.

The probability of detecting the mutation in the plasma (Pb) at aparticular mutation site is calculated as:

${Pb} = {1 - {\sum\limits_{i = 0}^{r - 1}{{Poisson}\left( {i,M_{p}} \right)}}}$

where r (cutoff value) is the number of times that a nucleotide changeis seen in the plasma so as to define it as a potential tumor-associatedmutation; Poisson(i,M_(P)) is the Poisson distribution probability ofhaving i occurrences with an average number of M_(P).

The total number of cancer-associated mutations expected to be detectedin the plasma (NP) can be calculated as: N_(P)=N_(T)×Pb, where N_(T) isthe number of mutations present in the tumor tissue. The followinggraphs show the percentages of tumor-associated mutations expected to bedetected in the plasma using different criteria of occurrences (r) forcalling a potential mutation and different sequencing depths.

FIG. 6A is a graph 600 showing the detection rate of cancer-associatedmutations in plasma with 10% and 20% plasma fractional concentrations oftumor-derived DNA and using four and six occurrences (r) as criteria forcalling potential cancer-associated mutations. With the same r, a higherfractional concentration of tumor-derived DNA in plasma would result ina higher number of cancer-associated mutations detectable in the plasma.With the same fractional concentration of tumor-derived DNA in plasma, ahigher r would result in a smaller number of detected mutations.

B. Number of False-Positive Single Detected Due to Errors

Single nucleotide changes in the plasma DNA sequencing data can occurdue to sequencing and alignment errors. The number of nucleotidepositions with false-positive single nucleotide changes can be predictedmathematically based on a binomial distribution. The parametersaffecting the number of false-positive sites (N_(FP)) can include: (1)Sequencing error rate (E)—Sequencing error rate is defined as theproportion of sequenced nucleotide being incorrect; (2) Sequencing depth(D)—With a higher sequencing depth, the number of nucleotide positionsshowing a sequencing error would increase; (3) The minimum number ofoccurrences of the same nucleotide change for defining a potentialcancer-associated mutation (r); and (4) The total number of nucleotidepositions within the region-of-interest (N_(I)).

The occurrence of mutations can generally be regarded as a randomprocess. Therefore, with the increase of the criteria of occurrence fordefining a potential mutation, the number of false-positive nucleotidepositions would exponentially decrease with r. In some of the existingsequencing platforms, certain sequence contexts are more prone to havingsequencing errors. Examples of such sequencing contexts include the GGCmotif, homopolymers (e.g. AAAAAAA), and simple repeats (e.g.ATATATATAT). These sequence contexts will substantially increase thesingle nucleotide change or insertion/deletion artifacts (Nakamura K etal. Nucleic Acids Res 2011; 39, e90 and Minoche A E et al. Genome Biol2011; 12,R112). In addition, repeat sequences, such as homopolymers andsimple repeats, would computationally introduce ambiguities in alignmentand, hence, lead to false-positive results for single nucleotidevariations.

The larger the region-of-interest, the higher the number offalse-positive nucleotide positions would be observed. If one is lookingfor mutations in the whole genome, then the region-of-interest would bethe whole genome and the number of nucleotides involved would be 3billion. On the other hand, if one focuses on the exons, then the numberof nucleotides encoding the exons, i.e. approximately 45 million, wouldconstitute the region-of-interest.

The number of false-positive nucleotide positions associated withsequencing errors can be determined based on the following calculations.The probability (P_(Er)) of having the same nucleotide change at thesame position due to sequencing errors can be calculated as:

$P_{Er} = {{C\left( {D,r} \right)}{E\left( \frac{E}{3} \right)}^{r - 1}}$

where C(D, r) is the number of possible combinations for choosing relements from a total of D elements; r is the number of occurrences fordefining a potential mutation; D is the sequencing depth; and E is thesequencing error rate. C(D, r) can be calculated as:

${C\left( {D,r} \right)} = \frac{D!}{{r!}{\left( {D - r} \right)!}}$

The number of nucleotide positions (N_(FP)) being false-positives formutations can be calculated as:

N _(FP) =N _(I) P _(Er)

where N_(I) is the total number of nucleotide positions in theregion-of-interest.

FIG. 6B is a graph 650 showing the expected number of nucleotidepositions falsely classified as having a nucleotide change usingcriteria of occurrence (r) of 4, 5, 6 and 7 vs. sequencing depth. Theregion-of-interest is assumed to be the whole genome (3 billionnucleotide positions) in this calculation. The sequencing error rate isassumed to be 0.3% of the sequenced nucleotides. As one can see, thevalue of r has a significant impact on the false positives. But, as canbe seen from FIG. 6A, a higher value of r also reduces the number ofmutations detected, at least until significantly higher sequencingdepths are used.

C. Choosing Minimum Occurrence (r)

As discussed above, the number of true cancer-associated mutation sitesand false-positive sites due to sequencing errors would increase withsequencing depth. However, their rates of increase would be different.Therefore, it is possible to make use of the choice of sequencing depthand the value of r to maximize the detection of true cancer-associatedmutations while keeping the number of false-positive sites at a lowvalue.

FIG. 7A is a graph 700 showing the number of true cancer-associatedmutation sites and false-positive sites with difference sequencingdepths. The total number of cancer-associated mutations in the tumortissue is assumed to be 3,000 and the fractional concentration oftumor-derived DNA in the plasma is assumed to be 10%. The sequencingerror rate is assumed to be 0.3%. In the legend, TP denotes thetrue-positive sites at which a corresponding mutation is present in thetumor tissue, and FP denotes false-positive sites at which nocorresponding mutation is present in the tumor tissue and the nucleotidechanges present in the sequencing data are due to sequencing errors.

From graph 700, at a sequencing depth of 110-fold, approximately 1,410true cancer-associated mutations would be detected if we use the minimumoccurrence of 6 as the criterion (r=6) to define a potential mutationsite in the plasma. Using this criterion, only approximately 20false-positive sites would be detected. If we use the minimum of 7occurrences (r=7) as the criterion to define a potential mutation, thenumber of cancer-associated mutations that could be detected would bereduced by 470 to approximately 940. Therefore, the criterion of r=6would make the detection of cancer-associated mutations in plasma moresensitive.

On the other hand, at a sequencing depth of 200-fold, the number of truecancer-associated mutations detected would be approximately 2,800 and2,600, if we use the criteria of minimum occurrence (r) of 6 and 7,respectively, to define potential mutations. Using these two values ofr, the numbers of false-positive sites would be approximately 740 and20, respectively. Therefore, at a sequencing depth of 200-fold, the useof a more stringent criterion of r=7 for defining a potential mutationcan greatly reduce the number of false-positive sites withoutsignificantly adversely affecting the sensitivity for detecting the truecancer-associated mutations.

D. Dynamic Cutoff for Sequencing Data for Defining Potential Mutationsin Plasma

The sequencing depth of each nucleotide within the region-of-interestwould be different. If we apply a fixed cutoff value for the occurrenceof a nucleotide change to define a potential mutation in plasma, thenucleotides that are covered by more sequence reads (i.e. a highersequencing depth) would have higher probabilities of being falselylabeled as having nucleotide variation in the absence of such a changein the tumor tissue due to sequencing errors compared with nucleotidesthat have lower sequencing depths. One embodiment to overcome thisproblem is to apply a dynamic cutoff value of r to different nucleotidepositions according to the actual sequencing depth of the particularnucleotide position and according to the desired upper limit of theprobability for calling false-positive variations.

In one embodiment, the maximum allowable false-positive rate can befixed at 1 in 1.5×10⁸ nucleotide positions. With this maximum allowablefalse-positive rate, the total number of false-positive sites beingidentified in the whole genome would be less than 20. The value of r fordifferent sequencing depths can be determined according to the curvesshown in FIG. 6B and these cutoffs are shown in Table 1. In otherembodiments, other different maximum allowable false-positive rates,e.g. 1 in 3×10⁸, 1 in 10⁸ or 1 in 6×10′, can be used. The correspondingtotal number of false-positive sites would be less than 10, 30 and 50,respectively.

TABLE 1 The minimum number of occurrences of a nucleotide change presentin plasma to define a potential mutation (r) for different sequencingdepths of the particular nucleotide position. The maximum false-positiverate is fixed at 1 in 1.5 × 10⁸ nucleotides. Sequencing Minimum numberof occurrence of a depth of a nucleotide change to be present in theparticular plasma DNA sequencing data to define nucleotide position apotential mutation (r) <50 5  50-110 6 111-200 7 201-310 8 311-450 9451-620 10 621-800 11

E. Target-Enrichment Sequencing

As shown in FIG. 7A, a higher sequencing depth can result in a bettersensitivity for detecting cancer-associated mutations while keeping thenumber false-positive sites low by allowing the use of a higher value ofr. For example, at a sequencing depth of 110-fold, 1,410 truecancer-associated mutations can be detected in the plasma using an rvalue of 6 whereas the number of true cancer-associated mutationsdetected would be 2,600 when the sequencing depth increases to 200-foldand an r value of 7 is applied. The two sets of data would give anexpected number of false-positive sites of approximately 20.

While the sequencing of the whole genome to a depth of 200-fold isrelatively expensive at present, one possible way of achieving such asequencing depth would be to focus on a smaller region-of-interest. Theanalysis of a target region can be achieved for example by, but notlimited to, the use of DNA or RNA baits to capture genomic regions ofinterest by hybridization. The captured regions are then pulled down,e.g., by magnetic means and then subjected to sequencing. Such targetcapture can be performed, for example, using the Agilent SureSelecttarget enrichment system, the Roche Nimblegen target enrichment systemand the Illumina targeted resequencing system. Another approach is toperform PCR amplification of the target regions and then performsequencing. In one embodiment, the region-of-interest is the exome. Insuch an embodiment, target capturing of all exons can be performed onthe plasma DNA, and the plasma DNA enriched for exonic regions can thenbe sequenced.

In addition to having higher sequencing depth, the focus on specificregions instead of analyzing the whole genome would significantly reducethe number of nucleotide positions in the search space and would lead toa reduction in the number of false-positive sites given the samesequencing error rate.

FIG. 7B is a graph 750 showing the predicted number of false-positivesites involving the analysis of the whole genome (WG) and all exons. Foreach type of analysis, two different values, 5 and 6, for r are used. Ata sequencing depth of 200-fold, if r=5 is used to define mutations inplasma, the predicted number of false-positive sites are approximately23,000 and 230 for the whole genome and all exons, respectively. If r=6is used to define mutations in plasma, the predicted number offalse-positive sites are 750 and 7, respectively. Therefore, the limitof the number of nucleotides in the region-of-interest can significantlyreduce the number of false-positives in plasma mutational analysis.

In exon-capture or even exome-capture sequencing, the number ofnucleotides in the search space is reduced. Therefore, even if we allowa higher false-positive rate for the detection of cancer-associatedmutations, the absolute number of false-positive sites can be kept as arelatively low level. The allowance of higher false-positive rate wouldallow a less stringent criterion of minimum occurrences (r) for defininga single nucleotide variation in plasma to be used. This would result ina higher sensitivity for the detection of true cancer-associatedmutations.

In one embodiment, we can use a maximum allowable false-positive rate of1.5×10⁶. With this false-positive rate, the total number offalse-positive sites within the targeted exons would only be 20. Thevalues of r for different sequencing depths using a maximum allowablefalse-positive rate of 1.5×10⁶ are shown in Table 2. In otherembodiments, other different maximum allowable false-positive rates,e.g. 1 in 3×10⁶, 1 in 10⁶ or 1 in 6×10⁵, can be used. The correspondingtotal number of false-positive sites would be less than 10, 30 and 50,respectively. In one embodiment, different classes of mutations can beattributed different weightings, as described above.

TABLE 2 The minimum number of occurrence of a nucleotide change presentin plasma to define a potential mutation (r) for different sequencingdepths of the particular nucleotide position. The maximum false-positiverate is fixed at 1 in 1.5 × 10⁶ nucleotides. Minimum number ofoccurrence of a nucleotide change to be present in the plasma DNAsequencing Sequencing depth of a particular data to define a potentialmutation nucleotide position (r) <50 4  50-125 5 126-235 6 236-380 7381-560 8 561-760 9

VIII. Cancer Detection

As mentioned above, the counts of sequence tags at variant loci can beused in various ways to determine the parameter, which is compared to athreshold to classify a level of cancer.

The fractional concentration of variant reads relative to all reads at alocus or many loci is another parameter that may be used. Below are someexamples of calculating the parameter and the threshold.

A. Determination of Parameter

If the CG is homozygous at a particular locus for a first allele and avariant allele is seen in the biological sample (e.g., plasma), then thefractional concentration can be calculated as 2p/(p+q), where p is thenumber of sequence tags having the variant allele and q is the number ofsequence tags having the first allele of the CG. This formula assumesthat that only one of the haplotypes of the tumor has the variant, whichwould typically be the case. Thus, for each homozygous locus afractional concentration can be calculated. The fractionalconcentrations can be averaged. In another embodiment, the count p caninclude the number of sequence tags for all of the loci, and similarlyfor the count q, to determine the fractional concentration. An exampleis now described.

The genomewide detection of tumor derived single nucleotide variants(SNVs) in the plasma of the 4 HCC patients was explored. We sequencedtumor DNA and buffy coat DNA to mean depths of 29.5-fold (range, 27-foldto 33-fold) and 43-fold (range, 39-fold to 46-fold) haploid genomecoverage, respectively. The MPS data from the tumor DNA and the buffycoat DNA from each of the 4 HCC patients were compared, and SNVs presentin the tumor DNA but not in the buffy coat DNA were mined with astringent bioinformatics algorithm. This algorithm required a putativeSNV to be present in at least a threshold number of sequenced tumor DNAfragments (i.e. in a corresponding sequenced tag) before it would beclassified as a true SNV. The threshold number was determined by takinginto account the sequencing depth of a particular nucleotide and thesequencing error rate, e.g., as described herein.

FIG. 8 is a table 800 showing results for 4 HCC patients before andafter treatment, including fractional concentrations of tumor-derivedDNA in plasma according to embodiments of the present invention. Thenumber of tumor-associated SNVs ranged from 1,334 to 3,171 in the 4 HCCcases. The proportions of such SNVs that were detectable in plasma arelisted before and after treatment. Before treatment, 15%-94% of thetumor associated SNVs were detected in plasma. After treatment, thepercentage was between 1.5%-5.5%. Thus, the number of detected SNVs doescorrelate to a level of cancer. This shows that the number of SNVs canbe used as a parameter to classify a level of cancer.

The fractional concentrations of tumor-derived DNA in plasma weredetermined by the fractional counts of the mutant with respect to thetotal (i.e., mutant plus wild type) sequences. The formula is 2p/(p+q),where the 2 accounts for just one haplotype being mutated on the tumor.These fractional concentrations were well correlated with thosedetermined with genomewide aggregated allelic loss (GAAL) analysis (ChanK C et al. Clin Chem 2013; 59:211-24) and were reduced after surgery.Thus, the fractional concentration is also shown to be a usableparameter for determining a level of cancer.

The fractional concentration from the SNV analysis can convey a tumorload. A cancer patient with a higher tumor load (e.g., a higher deducedfractional concentration) will have a higher frequency of somaticmutations than one with a lower tumor load. Thus, embodiments can alsobe used for prognostication. In general, cancer patients with highertumor loads have worse prognosis than those with lower tumor loads. Theformer group would thus have a higher chance of dying from the disease.In some embodiments, if the absolute concentration of DNA in abiological sample, e.g. plasma, can be determined (e.g. using real-timePCR or fluorometry), then the absolute concentration of tumor-associatedgenetic aberrations can be determined and used for clinical detectionand/or monitoring and/or prognostication.

B. Determining of Threshold

Table 800 may be used to determine a threshold. As mentioned above, thenumber of SNVs and a fractional concentration determined by SNV analysiscorrelate to a level of cancer. The threshold can be determined on anindividual basis. For example, the pre-treatment value can be used todetermine the threshold. In various implementations, the threshold couldbe a relative change from the pre-treatment of an absolute value. Asuitable threshold could be a reduction in number of SNVs or fractionalconcentration by 50%. Such a threshold would provide a classification ofa lower level of cancer for each of the cases in table 800. Note thatsuch threshold may be dependent on the sequencing depth.

In one embodiment, a threshold could be used across samples, and may ormay not account for pre-treatment values for the parameter. For example,a threshold of 100 SNVs could be used to classify the subject as havingno cancer or a low level of cancer. This threshold of 100 SNVs issatisfied by each of the four cases in table 800. If the fractionalconcentration was used as the parameter, a threshold of 1.0% wouldclassify HCC1-HCC3 as practically zero level of cancer, and a secondthreshold of 1.5% would classify HCC4 as a low level of cancer. Thus,more than one threshold may be used to obtain more than twoclassifications.

To illustrate other possible thresholds, we analyzed the plasma of thehealthy controls for the tumor-associated SNVs. Numerous measurementscan be made of healthy subjects to determine a range of how manyvariations are expected from the biological sample relative to theconstitutional genome.

FIG. 9 is a table 900 showing detection of the HCC-associated SNVs in 16healthy control subjects according to embodiments of the presentinvention. Table 900 can be used to estimate the specificity of an SNVanalysis approach. The 16 healthy controls are listed as different rows.The columns investigate the SNVs detected for the specific HCC patients,and show the number of sequence reads at variant loci having the variantallele and the number of sequence reads with the wildtype allele (i.e.,the allele from the CG). For example, for HCC1, control C01 had 40variant reads at such variant loci, but 31,261 reads of the wildtypeallele. The last column shows the total fractional concentration acrossall of the SNVs for the HCC1 patients. As the HCC-associated SNVs werespecific for the HCC patients, the presence of the HCC-associated SNVsrepresent false-positives. If a cutoff values, as described herein, areapplied to these apparent sequence variants, all of thesefalse-positives would be filtered away.

The presence of a small number of these putative tumor-associatedmutations in the plasma of the 16 healthy controls represented the“stochastic noise” of this method and was likely due to sequencingerrors. The mean fractional concentration estimated from such noise was0.38%. These values show a range for healthy subjects. Thus, a thresholdvalue for a classification of zero level of cancer for HCC could beabout 0.5%, since the highest fractional concentration was 0.43%. Thus,if all the cancer cells are removed from an HCC patient, these lowfractional concentrations would be expected.

Referring back to table 800, if 0.5% was used as a threshold for zerolevel of cancer, then the post-treatment plasma data for HCC1 and HCC3would be determined as having zero level based on the SNV analysis. HCC2might be classified as one level up from zero. HCC4 might also beclassified as one level up from zero, or some higher level, but still arelatively low level compared to the pre-treatment samples.

In one embodiment where the parameter corresponds to the number ofvariant loci, the threshold could be zero (i.e., one variant locus couldindicate a non-zero level of cancer). However, with many settings (e.g.,of depth), the threshold would be higher, e.g., an absolute value of 5or 10. In one implementation where a person is monitored aftertreatment, the threshold can be a certain percentage of SNVs (identifiedby analyzing the tumors directly) showing up in the sample. If thecutoff value for the number of variant reads required at a locus waslarge enough, just having one variant loci might be indicative of anon-zero level of cancer.

Thus, quantitative analysis of variations (e.g., single nucleotidevariations) in DNA from a biological sample (e.g., plasma) can be usedfor the diagnosis, monitoring and prognostication of cancer. For thedetection of cancer, the number of single nucleotide variations detectedin the plasma of a tested subject can be compared with that of a groupof healthy subjects. In the healthy subjects, the apparent singlenucleotide variations in plasma can be due to sequencing errors,non-clonal mutations from the blood cells and other organs. It has beenshown that the cells in normal healthy subjects could carry a smallnumber of mutations (Conrad D F et al. Nat Genet 2011; 43:712-4), asshown in table 900. Thus, the overall number of apparent singlenucleotide variations in the plasma of a group of apparently healthysubjects can be used as a reference range to determine if the testedpatient has an abnormally high number of single nucleotide variations inplasma corresponding to a non-zero level of cancer.

The healthy subjects used for establishing the reference range can bematched to the tested subject in terms of age and sex. In a previousstudy, it has been shown that the number of mutations in the somaticcells would increase with age (Cheung N K et al, JAMA 2012;307:1062-71). Thus, as we grow older, then it would be ‘normal’ for oneto accumulate clones of cells, even though they are relatively benignmost of the time, or would take a very long time to become clinicallysignificant. In one embodiment, reference levels can be generated fordifferent subject groups, e.g. different age, sex, ethnicity, and otherparameters (e.g. smoking status, hepatitis status, alcohol, drughistory).

The reference range can vary based on the cutoff value used (i.e., thenumber of variant sequence tags required at a locus), as well as theassumed false positive rate and other variables (e.g., age). Thus, thereference range may be determined for a particular set of one or morecriteria, and the same criteria would be used to determine a parameterfor a sample. Then, the parameter can be compared to the referencerange, since both were determined using the same criteria.

As mentioned above, embodiments may use multiple thresholds fordetermining a level of cancer. For example, a first level coulddetermine no signs of cancer for parameters below the threshold, and atleast a first level of cancer, which could be a pre-neoplastic level.Other levels could correspond to different stages of cancer.

C. Dependency on Experimental Variables

The depth of sequencing can be important for establishing the minimumdetection threshold of the minority (e.g. tumor) genome. For example, ifone uses a sequencing depth of 10 haploid genomes, then the minimumtumoral DNA concentration that one could detect even with a sequencingtechnology without any error is ⅕, i.e. 20%. On the other hand, if oneuses a sequencing depth of 100 haploid genomes, then one could go downto 2%. This analysis is referring to the scenario that only one mutationlocus is being analyzed. However, when more mutation loci are analyzed,the minimum tumoral DNA concentration can be lower and is governed by abinomial probability function. For example, if the sequencing depth is10 folds and the fractional concentration of tumoral DNA is 20%, thenthe chance of detecting the mutation is 10%. However, if we have 10mutations, then the chance of detecting at least one mutation would be1−(1−10%)¹⁰=65%.

There are several effects for increasing the sequencing depth. Thehigher the sequencing depth, the more sequencing errors would be seen,see FIGS. 4 and 5. However, with a higher sequencing depth, one can moreeasily differentiate sequencing errors from mutations due to clonalexpansion of a subpopulation of cells (e.g. cancer cells) because thesequencing errors would occur randomly in the genome but the mutationswould occur at the same location for the given population of cells.

The higher the sequencing depth, the more mutations from the “healthycells” would be identified. However, when there is no clonal expansionof these healthy cells and their mutational profiles are different, thenthe mutations in these healthy cells can be differentiated from themutations by their frequencies of occurrence in the plasma (e.g., byusing a cutoff N for a required number of reads exhibiting the mutation,such as having N equal to 2, 3, 4, 5, or larger).

As mentioned above, the threshold can depend on an amount of mutationsin healthy cells that would be clonally expanded, and thus might not befiltered out through other mechanisms. This variance that one wouldexpect can be obtained by analyzing healthy subjects. As clonalexpansion occurs over time, the age of patient can affect a variancethat one sees in healthy subjects, and thus the threshold can bedependent on age.

D. Combination with Targeted Approaches

In some embodiments, a random sequencing can be used in combination withtargeted approaches. For example, one can perform random sequencing of aplasma sample upon presentation of a cancer patient. The sequencing dataof plasma DNA can be analyzed for copy number aberrations and SNVs. Theregions showing aberrations (e.g., amplification/deletion or highdensity of SNVs) can be targeted for serial monitoring purposes. Themonitoring can be done over a period of time, or done immediately afterthe random sequencing, effectively as a single procedure. For thetargeted analysis, solution-phase hybridization-based capture approacheshave been successfully used to enrich plasma DNA for noninvasiveprenatal diagnosis (Liao G J et al. Clin Chem 2011; 57:92-101). Suchtechniques are mentioned above. Thus, the targeted and random approachescan be used in combination for cancer detection and monitoring.

Thus, one could perform targeted sequencing of the loci that are foundto be potentially mutated using the non-targeted, genomewide approachmentioned above. Such targeted sequencing could be performed usingsolution- or solid-phase hybridization techniques (e.g. using theAgilent SureSelect, NimbleGen Sequence Capture, or Illumina targetedresequencing system) followed by massively parallel sequencing. Anotherapproach is to perform amplification (e.g. PCR based) system fortargeted sequencing (Forshew T et al. Sci Transl Med 2012; 4: 135ra68).

IX. Fractional Concentration

The fractional concentration of tumor DNA can be used to determine thecutoff value for the required number of variations at a locus before thelocus is identified as a mutation. For example, if the fractionalconcentration was known to be relatively high, then a high cutoff couldbe used to filter out more false positives, since one knows that arelatively high number of variant reads should exist for true SNVs. Onthe other hand, if the fractional concentration was low, then a lowercutoff might be needed so that some SNVs are not missed. In this case,the fractional concentration would be determined by a different methodthan the SNV analysis, where it is used as a parameter.

Various techniques may be used for determining the fractionalconcentration, some of which are described herein. These techniques canbe used to determine the fractional concentration of tumor-derived DNAin a mixture, e.g. a biopsy sample containing a mixture of tumor cellsand nonmalignant cells or a plasma sample from a cancer patientcontaining DNA released from tumor cells and DNA released fromnonmalignant cells.

A. GAAL

Genomewide aggregated allelic loss (GAAL) analyzes loci that have lostheterozygosity (Chan K C et al. Clin Chem 2013; 59:211-24). For a siteof the constitutional genome CG that is heterozygous, a tumor often hasa locus that has a deletion of one of the alleles. Thus, the sequencereads for such a locus will show more of one allele than another, wherethe difference is proportional to the fractional concentration of tumorDNA in the sample. An example of such a calculation follows.

DNA extracted from the buffy coat and the tumor tissues of the HCCpatients was genotyped with the Affymetrix Genome-Wide Human SNP Array6.0 system. The microarray data were processed with the AffymetrixGenotyping Console version 4.1. Genotyping analysis andsingle-nucleotide polymorphism (SNP) calling were performed with theBirdseed v2 algorithm. The genotyping data for the buffy coat and thetumor tissues were used for identifying loss-of-heterozygosity (LOH)regions and for performing copy number analysis. Copy number analysiswas performed with the Genotyping Console with default parameters fromAffymetrix and with a minimum genomic-segment size of 100 bp and aminimum of 5 genetic markers within the segment.

Regions with LOH were identified as regions having 1 copy in the tumortissue and 2 copies in the buffy coat, with the SNPs within theseregions being heterozygous in the buffy coat but homozygous in the tumortissue. For a genomic region exhibiting LOH in a tumor tissue, the SNPalleles that were present in the buffy coat but were absent from or ofreduced intensity in the tumor tissues were considered to be the alleleson the deleted segment of the chromosomal region. The alleles that werepresent in both the buffy coat and the tumor tissue were deemed ashaving been derived from the non-deleted segment of the chromosomalregion. For all the chromosomal regions with a single copy loss in thetumor, the total number of sequence reads carrying the deleted allelesand the non-deleted alleles were counted. The difference of these twovalues was used to infer the fractional concentration of tumor-derivedDNA (F_(GAAL)) in the sample using the following equation:

$F_{GAAL} = \frac{N_{{non} - {del}} - N_{del}}{N_{{non} - {del}}}$

where N_(non-del) represents the total number of sequence reads carryingthe non-deleted alleles and N_(del) represents the total number ofsequence reads carrying the deleted alleles.

B. Estimation Using Genomic Representation

A problem with the GAAL technique is that particular loci (i.e. onesexhibiting LOH) are identified and only sequence reads aligning to suchloci are used. Such requirement can add additional steps, and thuscosts. An embodiment is now described which uses only copy number, e.g.,a sequence read density.

Chromosomal aberrations, for example, amplifications and deletions arefrequently observed in cancer genomes. The chromosomal aberrationsobserved in cancer tissues typically involve subchromosomal regions andthese aberrations can be shorter than 1 Mb. And, the cancer-associatedchromosomal aberrations are heterogeneous in different patients, andthus different regions may be affected in different patients. It is alsonot uncommon for tens, hundreds or even thousands of copy numberaberrations to be found in a cancer genome. All of these factors makedetermining tumor DNA concentration difficult.

Embodiments involve the analysis of quantitative changes resulted fromtumor-associated chromosomal aberrations. In one embodiment, the DNAsamples containing DNA derived from cancer cells and normal cells aresequenced using massively parallel sequencing, for example, by theIllumina HiSeq2000 sequencing platform. The derived DNA may be cell-freeDNA in plasma or other suitable biological sample.

Chromosomal regions that are amplified in the tumor tissues would haveincreased probability of being sequenced and regions that are deleted inthe tumor tissues would have reduced probability of being sequenced. Asa result, the density of sequence reads aligning to the amplifiedregions would be increased and that aligning to the deleted regionswould be reduced. The degree of variation is proportional to thefractional concentration of the tumor-derived DNA in the DNA mixture.The higher the proportion of DNA from the tumor tissue, the larger thechange would be caused by the chromosomal aberrations.

1. Estimation in Sample with High Tumor Concentration

DNA was extracted from the tumor tissues of four hepatocellularcarcinoma patients. The DNA was fragmented using the Covaria DNAsonication system and sequenced using the Illumina HiSeq2000 platform asdescribed (Chan K C et al. Clin Chem 2013; 59:211-24). The sequencereads were aligned to the human reference genome (hg18). The genome wasthen divided into 1 Mb bins (regions) and the sequence read density wascalculated for each bin after adjustment for GC-bias as described (ChenE Z et al. PLoS One. 2011; 6:e21791).

After sequence reads are aligned to a reference genome, a sequence readdensity can be computed for various regions. In one embodiment, thesequence read density is a proportion determined as the number of readsmapped to a particular bin (e.g., 1 Mb region) divided by the totalsequence reads that can be aligned to the reference genome (e.g., to aunique position in the reference genome). Bins that overlap withchromosomal regions amplified in the tumor tissue are expected to havehigher sequence read densities than those from bins without suchoverlaps. On the other hand, bins that overlap with chromosomal regionsthat are deleted are expected to have lower sequence read densities thanthose without such overlaps. The magnitude of the difference in sequenceread densities between regions with and without chromosomal aberrationsis mainly affected by the proportion of tumor-derived DNA in the sampleand the degree of amplification/deletion in the tumor cells.

Various statistical models may be used to identify the bins havingsequence read densities corresponding to different types of chromosomalaberrations. In one embodiment, a normal mixture model (McLachlan G andPeel D. Multvariate normal mixtures. In Finite mixture models 2004: p81-116. John Wiley & Sons Press) can be used. Other statistical models,for example the binomial mixture model and Poisson regression model(McLachlan G and Peel D. Mixtures with non-normal components, Finitemixture models 2004: p 135-174. John Wiley & Sons Press), can also beused.

The sequence read density for a bin can be normalized using the sequenceread density of the same bin as determined from the sequencing of thebuffy coat DNA. The sequence read densities of different bins may beaffected by the sequence context of a particular chromosomal region, andthus the normalization can help to more accurately identify regionsshowing aberration. For example, the mappability (which refers to theprobability of aligning a sequence back to its original position) ofdifferent chromosomal regions can be different. In addition, thepolymorphism of copy number (i.e. copy number variations) would alsoaffect the sequence read densities of the bins. Therefore, normalizationwith the buffy coat DNA can potentially minimize the variationsassociated with the difference in the sequence context between differentchromosomal regions.

FIG. 10A shows a distribution plot 1000 of the sequence read densitiesof the tumor sample of an HCC patient according to embodiments of thepresent invention. The tumor tissue was obtained following surgicalresection from the HCC patient. The x-axis represents the log₂ of theratio (R) of the sequence read density between the tumor tissue and thebuffy coat of the patient. The y-axis represents the number of bins.

Peaks can be fitted to the distribution curve to represent the regionswith deletion, amplification, and without chromosomal aberrations usingthe normal mixture model. In one embodiment, the number of peaks can bedetermined by the Akaike's information criterion (AIC) across differentplausible values. The central peak with a log₂ R=0 (i.e. R=1) representsthe regions without any chromosomal aberration. The left peak (relativeto the central one) represents regions with one copy loss. The rightpeak (relative to central one) represents regions with one copyamplification.

The fractional concentration of tumor-derived DNA can be reflected bythe distance between the peaks representing the amplified and deletedregions. The larger the distance, the higher the fractionalconcentration of the tumor-derived DNA in the sample would be. Thefractional concentration of tumor-derived DNA in the sample can bedetermined by this genomic representation approach, denoted as F_(GR),using the following equation: F_(GR)=R_(right)−R_(left), where R_(right)is the R value of the right peak and R_(left) is the R value of the leftpeak. The largest difference would be 1, corresponding to 100%. Thefractional concentration of tumor-derived DNA in the tumor sampleobtained from the HCC patient is estimated to be 66%, where the valuesof R_(right) and R_(left) are 1.376 and 0.712, respectively.

To verify this result, another method using the genomewide aggregatedallele loss (GAAL) analysis was also used to independently determine thefractional concentration of proportion of tumoral DNA (Chan K C et al.Clin Chem 2013; 59:211-24). Table 3 shows the fractional concentrationsof tumor-derived DNA in the tumor tissues of the four HCC patients usingthe genomic representation (F_(GR)) and the GAAL (F_(GAAL)) approaches.The values determined by these two different approaches agree well witheach other.

TABLE 3 showing fractional concentration determined by GAAL and genomicrepresentation (GR). HCC tumor F_(GAAL) F_(GR) 1 60.0% 66.5% 2 60.0%61.4% 3 58.0% 58.9% 4 45.7% 42.2%

2. Estimation in Sample with Low Tumor Concentration

The above analysis has shown that our genomic representation method canbe used to measure the fractional concentration of tumor DNA when morethan 50% of the sample DNA is tumor-derived, i.e. when the tumor DNA isa majority proportion. In the previous analysis, we have shown that thismethod can also be applied to samples in which the tumor-derived DNArepresents a minor proportion (i.e., below 50%). Samples that maycontain a minor proportion of tumor DNA include, but not limited toblood, plasma, serum, urine, pleural fluid, cerebrospinal fluid, tears,saliva, ascitic fluid and feces of cancer patients. In some samples, thefractional concentration of tumor-derived DNA can be 49%, 40%, 30%, 20%,10%, 5%, 2%, 1%, 0.5%, 0.1% or lower.

For such samples, the peaks of sequence read density representing theregions with amplification and deletion may not be as obvious as insamples containing a relatively high concentration of tumor-derived DNAas illustrated above. In one embodiment, the regions with chromosomalaberrations in the cancer cells can be identified by making comparisonto reference samples which are known to not contain cancer DNA. Forexample, the plasma of subjects without a cancer can be used asreferences to determine the normative range of sequence read densitiesfor the chromosome regions. The sequence read density of the testedsubject can be compared with the value of the reference group. In oneembodiment, the mean and standard deviation (SD) of sequence readdensity can be determined. For each bin, the sequence read density ofthe tested subject is compared with the mean of the reference group todetermine the z-score using the following formula:

${{z - {score}} = \frac{\left( {{GR}_{test} - {\overset{\_}{GR}}_{ref}} \right)}{{SD}_{ref}}},$

where GR_(test) represents the sequence read density of the cancerpatient; GR _(ref) represents the mean sequence read density of thereference subjects and SD_(ref) represents the SD of the sequence readdensities for the reference subjects.

Regions with z-score<−3 signifies significant underpresentation of thesequence read density for a particular bin in the cancer patientsuggesting the presence of a deletion in the tumor tissue. Regions withz-score>3 signifies significant overpresentation of the sequence readdensity for a particular bin in the cancer patient suggesting thepresence of an amplification in the tumor tissue

Then, the distribution of the z-scores of all the bins can beconstructed to identify regions with different numbers of copy gain andloss, for example, deletion of 1 or 2 copies of a chromosome; andamplification, resulting in of 1, 2, 3 and 4 additional copies of achromosome. In some cases, more than one chromosome or more than oneregions of a chromosome may be involved.

FIG. 10B shows a distribution plot 1050 of z-scores for all the bins inthe plasma of a HCC patient according to embodiments of the presentinvention. The peaks (from left to right) representing 1-copy loss, nocopy change, 1-copy gain and 2-copy gain are fitted to the z-scoredistribution. Regions with different types of chromosomal aberrationscan then be identified, for example using the normal mixture model asdescribed above.

The fractional concentration of the cancer DNA in the sample (F) canthen be inferred from the sequence read densities of the bins thatexhibit one-copy gain or one-copy loss. The fractional concentrationdetermined for a particular bin can be calculated as

$F = {\frac{{{{GR}_{test} - {\overset{\_}{GR}}_{ref}}} \times 2}{{GR}_{ref}} \times 100{\%.}}$

This can also be expressed as:

${F = {\frac{{z - {{score} \times {SD}_{ref}}}}{{\overset{\_}{GR}}_{ref}} \times 2}},$

which can be rewritten as: F=|z-score|×CV×2, where CV is the coefficientof variation for the measurement of the sequence read density of thereference subjects; and

${CV} = {\frac{{SD}_{ref}}{{\overset{\_}{GR}}_{ref}}.}$

In one embodiment, the results from the bins are combined. For example,the z-scores of bins showing a 1-copy gain can be averaged or theresulting F values averaged. In another implementation, the value of thez-score used for inferring F is determined by a statistical model and isrepresented by the peaks shown in FIG. 10B and FIG. 11. For example, thez-score of the right peak can be used to determine the fractionalconcentration for the regions exhibiting 1-copy gain.

In another embodiment, all bins with z-score<−3 and z-score>3 can beattributed to regions with single copy loss and single copy gain,respectively, because these two types of chromosomal aberrations are themost common. This approximation is most useful when the number of binswith chromosomal aberrations is relatively small and fitting of normaldistribution may not be accurate.

FIG. 11 shows a distribution plot 1100 of z-scores for the plasma of anHCC patient according to embodiments of the present invention. While thenumber of bins overlapping with chromosomal aberrations is relativelysmall, all bins with z-score<−3 and z-score>3 were fitted to the normaldistributions of single copy loss and single copy gain, respectively.

The fractional concentrations of tumor-derived DNA in the plasma of thefour HCC patients were determined using GAAL analysis and this GR-basedapproach. The results are shown in Table 4. As can be seen, the deducedfractional representation correlates well between the GAAL analysis andthe GR analysis.

TABLE 4 Fractional concentration of tumor-derived DNA in plasma deducedby the analysis of chromosomal aberrations. Fractional concentration oftumor-derived DNA in plasma Samples GAAL analysis GR analysis case114.3% 4.5% case13   5% 5.5% case23  52%  62% case27 7.6% 6.1%

C. Method of Determining Fractional Concentration

FIG. 12 is a flowchart of a method 1200 of determining a fractionalconcentration of tumor DNA in a biological sample including cell-freeDNA according to embodiments of the present invention. Method 1200 maybe performed via various embodiments, including embodiments describedabove.

At block 1210, one or more sequence tags are received for each of aplurality of DNA fragments in the biological sample. Block 1210 may beperformed as described herein for other methods. For example, one end ofa DNA fragment may be sequenced from a plasma sample. In anotherembodiment, both ends of a DNA fragment may be sequenced, therebyallowing a length of the fragment to be estimated.

At block 1220, genomic positions are determined for the sequence tags.The genomic positions can be determined, e.g., as described herein byaligning the sequence tags to a reference genome. If both ends of afragment are sequenced, then the paired tags may be aligned as a pairwith a distance between the two tags constrained to be less than aspecified distance, e.g., 500 or 1,000 basis.

At block 1230, for each of a plurality of genomic regions, a respectiveamount of DNA fragments within the genomic region is determined fromsequence tags having a genomic position within the genomic region. Thegenomic regions may be non-overlapping bins of equal length in thereference genome. In one embodiment, a number of tags that align to abin can be counted. Thus, each bin can have a corresponding number ofaligned tags. A histogram can be computed illustrating a frequency thatbins have a certain number of aligned tags. Method 1200 may be performedfor genomic regions each having a same length (e.g., 1 Mb bins), wherethe regions are non-overlapping. In other embodiments, different lengthscan be used, which may be accounted for, and the regions may overlap.

At block 1240, the respective amount is normalized to obtain arespective density. In one embodiment, normalizing the respective amountto obtain a respective density includes using a same total number ofaligned reference tags to determine the respective density and thereference density. In another embodiment, the respective amount can bedivided by a total number of aligned reference tags.

At block 1250, the respective density is compared to a reference densityto identify whether the genomic region exhibits a 1-copy loss or a1-copy gain. In one embodiment, a difference is computed between therespective density and the reference density (e.g., as part ofdetermining a z-score) and compared to a cutoff value. In variousembodiments, the reference density can be obtained from a sample ofhealthy cells (e.g., from the buffy coat) or from the respective amountsthemselves (e.g., by taking an median or average value, under anassumption that most regions do not exhibit a loss or a gain).

At block 1260, a first density is calculated from one or more respectivedensities identified as exhibiting a 1-copy loss or from one or morerespective densities identified as exhibiting a 1-copy gain. The firstdensity can correspond to just one genomic region, or may be determinedfrom densities of multiple genomic regions. For example, the firstdensity may be computed from respective densities having a 1-copy loss.The respective densities provide a measure of the amount of the densitydifference resulting from the deletion of the region in a tumor, giventhe tumor concentration. Similarly, if the first density is fromrespective densities having a 1-copy gain, then a measure of the amountof density difference resulting from the duplication of the region in atumor can be obtained. Sections above describe various examples of howthe densities of multiple regions can be used to determine an averagedensity to be used for the first density.

At block 1270, the fractional concentration is calculated by comparingthe first density to another density to obtain a differential. Thedifferential is normalized with the reference density, which may be donein block 1270. For example, the differential can be normalized with thereference density by dividing the differential by the reference density.In another embodiment, the differential can be normalized in earlierblocks.

In one implementation, the another density is the reference density,e.g., as in section 2 above. Thus, calculating the fractionalconcentration may include multiplying the differential by two. Inanother implementation, the another density is a second densitycalculated from respective densities identified as exhibiting a 1-copyloss (where the first density is calculated using respective densitiesidentified as exhibiting a 1-copy gain), e.g., as described in section 1above. In this case, the normalized differential can be determined bycomputing a first ratio (e.g., R_(right)) of the first density and thereference density and computing a second ratio (R_(left)) of the seconddensity and the reference density, where the differential is between thefirst ratio and the second ratio. As described above, the identificationof genomic region exhibiting a 1-copy loss or a 1-copy gain can beperformed by fitting peaks to a distribution curve of a histogram of therespective densities.

In summary, embodiments can analyze the genomic representation of plasmaDNA in different chromosomal regions to simultaneously determine if thechromosomal region is amplified or deleted in the tumor tissue and, ifthe region is amplified or deleted, to use its genomic representation todeduce the fractional concentration of the tumor-derived DNA. Someimplementations use a normal mixture model to analyze the overalldistribution of the genomic representation of different regions so as todetermine the genomic representation associated with different types ofaberrations, namely gains of 1, 2, 3 or 4 copies and the losses of 1 or2 copies.

Embodiments have several advantages over other methods, for examplegenomewide aggregated allelic loss (GAAL) approach (U.S. patentapplication Ser. No. 13/308,473; Chan K C et al. Clin Chem 2013;59:211-24) and the analysis of tumor-associated single nucleotidemutations (Forshew T et al. Sci Transl Med. 2012; 4:136ra68). Allsequence reads mapping to regions with chromosomal aberrations can beused to determine the sequence read density of the region and, hence,are informative regarding the fractional concentration of tumoral DNA.On the other hand, in GAAL analysis, only sequence reads covering singlenucleotides that are heterozygous in the individual and located within achromosomal region with chromosome gain or loss would be informative.Similarly, for the analysis of cancer-associated mutations, onlysequence reads covering the mutations would be useful for the deductionof the tumoral DNA concentration. Therefore, embodiments can allow amore cost-effective use of the sequencing data as relatively fewersequencing reads may be needed to achieve the same degree of accuracy inthe estimation of fractional concentration of tumor-derived DNA whencompared with other approaches.

X. Alternative Methodologies

Apart from using the number of times that a particular mutation is seenon a sequence tag as a criteria for identifying a locus as being a truemutation (thereby adjusting the positive predictive value), one couldemploy other techniques instead of or in addition to using a cutoffvalue to provide greater predictive value in identifying a cancerousmutation. For example, one could use bioinformatics filters of differentstringencies when processing the sequencing data, e.g., by taking intoaccount the quality score of a sequenced nucleotide. In one embodiment,one could use DNA sequencers and sequencing chemistries with differentsequencing error profiles. Sequencers and chemistries with lowersequencing error rates would give a higher positive predictive values.One can also use repeated sequencing of the same DNA fragment toincrease the sequencing accuracy. One possible strategy is the circularconsensus sequencing strategy of Pacific Biosciences.

In another embodiment, one could incorporate size information on thesequenced fragments into the interpretation of the data. Astumor-derived DNA is shorter than the non-tumor-derived DNA in plasma(see U.S. patent application Ser. No. 13/308,473), the positivepredictive value of a shorter plasma DNA fragment containing a potentialtumor-derived mutation will be higher than that of a longer plasma DNAfragment. The size data will be readily available if one performspaired-end sequencing of the plasma DNA. As an alternative, one coulduse DNA sequencers with long read lengths, thus yielding the completelength of a plasma DNA fragment. One could also perform sizefractionation of the plasma DNA sample prior to DNA sequencing. Examplesof methods that one could use for size fractionation include gelelectrophoresis, the use of microfluidics approach (e.g. the CaliperLabChip XT system) and size-exclusion spin columns.

In yet another embodiment, the fractional concentration oftumor-associated mutations in plasma in a patient with non-hematologiccancer would be expected to increase if one focuses on the shorter DNAfragments in plasma. In one implementation, one can compare thefractional concentration of tumor-associated mutations in plasma in DNAfragments of two or more different size distributions. A patient with anon-hematologic cancer will have higher fractional concentrations oftumor-associated mutations in the shorter fragments when compared withthe larger fragments.

In some embodiments, one could combine the sequencing results from twoor more aliquots of the same blood sample, or from two or more bloodsamples taken on the same occasions or on different occasions. Potentialmutations seen in more than one aliquot or samples would have a higherpositive predictive value of tumor-associated mutations. The positivepredictive value would increase with the number of samples that showsuch a mutation.

The potential mutations that are present in plasma samples taken atdifferent time points can be regarded as potential mutations.

XI. Examples

The following are example techniques and data, which should not beconsidered limiting on embodiments of the present invention.

A. Materials And Methods

Regarding sample collection, hepatocellular carcinoma (HCC) patients,carriers of chronic hepatitis B, and a patient with synchronous breastand ovarian cancers were recruited. All HCC patients had BarcelonaClinic Liver Cancer stage A1 disease. Peripheral blood samples from allparticipants were collected into EDTA-containing tubes. The tumortissues of the HCC patients were obtained during their cancer resectionsurgeries.

Peripheral blood samples were centrifuged at 1,600 g for 10 min at 4° C.The plasma portion was recentrifuged at 16,000 g for 10 min at 4° C. andthen stored at 80° C. Cell-free DNA molecules from 4.8 mL of plasma wereextracted according to the blood and body fluid protocol of the QIAampDSP DNABlood Mini Kit (Qiagen). The plasma DNA was concentrated with aSpeedVac Concentrator (Savant DNA 120; Thermo Scientific) into a 40-μlfinal volume per case for subsequent preparation of the DNA-sequencinglibrary

Genomic DNA was extracted from patients' buffy coat samples according tothe blood and body fluid protocol of the QIAamp DSP DNA Blood Mini Kit.DNA was extracted from tumor tissues with the QIAamp DNA Mini Kit(Qiagen).

Sequencing libraries of the genomic DNA samples were constructed withthe Paired-End Sample Preparation Kit (Illumina) according to themanufacturer's instructions. In brief, 1-5 micrograms of genomic DNA wasfirst sheared with a Covaris S220 Focused-ultrasonicator to 200-bpfragments. Afterward, DNA molecules were end-repaired with T4 DNApolymerase and Klenow polymerase; T4 polynucleotide kinase was then usedto phosphorylate the 5′ ends. A 3′ overhang was created with a 3′-to-5′exonuclease-deficient Klenow fragment. Illumina adapter oligonucleotideswere ligated to the sticky ends. The adapter-ligated DNA was enrichedwith a 12-cycle PCR. Because the plasma DNA molecules were shortfragments and the amounts of total DNA in the plasma samples wererelatively small, we omitted the fragmentation steps and used a 15-cyclePCR when constructing the DNA libraries from the plasma samples.

An Agilent 2100 Bioanalyzer (Agilent Technologies) was used to check thequality and size of the adapter-ligated DNA libraries. DNA librarieswere then measured by a KAPA Library Quantification Kit (KapaBiosystems) according to the manufacturer's instructions. The DNAlibrary was diluted and hybridized to the paired-end sequencing flowcells. DNA clusters were generated on a cBot cluster generation system(Illumina) with the TruSeq PE Cluster Generation Kit v2 (Illumina),followed by 51_2 cycles or 76_2 cycles of sequencing on a HiSeq 2000system (Illumina) with the TruSeq SBS Kit v2 (Illumina).

The paired-end sequencing data were analyzed by means of the ShortOligonucleotide Alignment Program 2 (SOAP2) in the paired-end mode. Foreach paired-end read, 50 bp or 75 bp from each end was aligned to thenon-repeat-masked reference human genome (hg18). Up to 2 nucleotidemismatches were allowed for the alignment of each end. The genomiccoordinates of these potential alignments for the 2 ends were thenanalyzed to determine whether any combination would allow the 2 ends tobe aligned to the same chromosome with the correct orientation, spanningan insert size less than or equal to 600 bp, and mapping to a singlelocation in the reference human genome. Duplicated reads were defined aspaired-end reads in which the insert DNA molecule showed identical startand end locations in the human genome; the duplicate reads were removedas previously described (Lo et al. Sci Transl Med 2010; 2: 61ra91).

In some embodiments, the paired tumor and constitutional DNA sampleswere sequenced to identify the tumor-associated single nucleotidevariants (SNVs). In some implementations, we focused on the SNVsoccurring at homozygous sites in the constitutional DNA (in this examplebeing the buffy coat DNA). In principle, any nucleotide variationdetected in the sequencing data of the tumor tissues but absent in theconstitutional DNA could be a potential mutation (i.e., a SNV). Becauseof sequencing errors (0.1%-0.3% of sequenced nucleotides), however,millions of false positives would be identified in the genome if asingle occurrence of any nucleotide change in the sequencing data of thetumor tissue were to be regarded as a tumor-associated SNV. One way toreduce the number of false positives would be to institute the criterionof observing multiple occurrences of the same nucleotide change in thesequencing data in the tumor tissue before a tumor associated SNV wouldbe called.

Because the occurrence of sequencing errors is a stochastic process, thenumber of false positives due to sequencing errors would decreaseexponentially with the increasing number of occurrences required for anobserved SNV to be qualified as a tumor-associated SNV. On the otherhand, the number of false positives would increase with increasingsequencing depth. These relationships could be predicted with Poissonand binomial distribution functions. Embodiments can determine a dynamiccutoff of occurrence for qualifying an observed SNV as tumor associated.Embodiments can take into account the actual coverage of the particularnucleotide in the tumor sequencing data, the sequencing error rate, themaximum false-positive rate allowed, and the desired sensitivity formutation detection.

In some examples, we set very stringent criteria to reduce falsepositives. For example, a mutation may be required to be completelyabsent in the constitutional DNA sequencing, and the sequencing depthfor the particular nucleotide position had to be 20-fold. In someimplementations, the cutoff of occurrence achieved a false-positivedetection rate of less than 10⁻⁷. In some examples, we also filtered outSNVs that were within centromeric, telomeric, and low-complexity regionsto minimize false positives due to alignment artifacts. In addition,putative SNVs mapping to known SNPs in the dbSNP build 135 database werealso removed.

B. Before and after Resection

FIG. 13A shows a table 1300 of the analysis of mutations in the plasmaof the patient with ovarian cancers and a breast cancer at the time ofdiagnosis according to embodiments of the present invention. Here, wedemonstrate an example for a patient with bilateral ovarian cancers anda breast cancer. The sequencing data of the plasma were compared to thesequencing results of the constitutional DNA of the patient (buffycoat). Single nucleotide changes that were present in the plasma but notin the constitutional DNA were regarded as potential mutations. Theovarian cancers on the right and left side of the patient were eachsampled at two sites, making a total of four tumor samples. The tumormutations were mutations detected in all the four ovarian tumor tissuesat four different sites.

Over 3.6 million single nucleotide changes were detected in the plasmafor at least one time by sequencing. Among these changes, only 2,064were also detected in the tumor tissues giving a positive predictionvalue of 0.06%. Using the criterion of being detected at least two timesin plasma, the number of potential mutations was significantly reducedby 99.5% to 18,885. The number of tumor mutations was only reduced by 3%to 2,003, and the positive prediction value increased to 11%.

Using the criteria of detecting at least five times in plasma, only2,572 potential mutations were detected and amongst them, 1,814 weremutations detected in all the tumor tissues, thus, giving a positivepredictive value of 71%. Other criteria for the number of occurrences(e.g. 2, 3, 4, 6, 7, 8, 9, 10, etc.) can be used for defining potentialmutations depending on the sensitivity and positive predictive valuerequired. The higher the number of occurrences is used as the criterion,the higher the positive predictive value would be with a reduction inthe sensitivity.

FIG. 13B shows a table 1350 of the analysis of mutations in the plasmaof the patient with bilateral ovarian cancers and a breast cancer aftertumor resection according to embodiments of the present invention.Surgical resection of the patient was performed. A blood sample wastaken one day after the resection of the ovarian tumors and the breastcancer. The plasma DNA was then sequenced. For this example, only themutations from the ovarian cancers were analyzed. Over 3 millionpotential mutations were detected at least once in a plasma sample.However, using a criterion of having at least five occurrences, thenumber of potential mutations was reduced to 238. A significantreduction was observed when compared with the number of potentialmutations for the sample taken at diagnosis and using the same criterionof five mutations.

In one embodiment, the number of single nucleotide changes detected inplasma can be used as a parameter for the detection, monitoring andprognostication of a cancer patient. Different number of occurrences canbe used as the criterion to achieve the desired sensitivity andspecificity. A patient with a higher tumor load and thus worse prognosiswill be expected to have a higher mutational load seen in plasma.

For such analysis, one could establish the mutational load profile fordifferent types of cancer. For monitoring purposes, one would see thatthe mutational load in plasma of a patient who responds to treatmentwould reduce. If the tumor has recurred, e.g. during a relapse, then themutational load will be expected to increase. Such monitoring wouldallow one to monitor the efficacy of the selected modality of treatmentfor a patient and to detect the emergence of resistance to a particulartreatment.

Through the analysis of the specific mutations that one could see in theplasma DNA sequencing results, one could also identify targets thatwould predict sensitivity (e.g. mutations in the epidermal growth factorreceptor gene and response to tyrosine kinase inhibitor treatment) andresistance to particular targeted treatment (e.g. KRAS mutations incolorectal cancer and resistance to treatment by panitumumab andcetuximab), and could guide the planning of treatment regimes.

The example above was for the bilateral ovarian cancers. One could alsoperform the same analysis on the mutations of the breast cancer and thenwould be able to track the mutations of both of these cancer types inthe plasma. One can also use a similar strategy to track the mutationsof a primary cancer and its metastasis or metastases.

Embodiments would be useful to the screening of cancer in apparentlyhealthy subjects or in subjects with particular risk factors (e.g.smoking status, viral status (such as hepatitis virus carriers, humanpapillomavirus infected subjects)). The mutational load that one couldsee in the plasma of such subjects would give a risk that the subjectwould develop symptomatic cancer within a particular timeframe. Thus,subjects with a higher mutational load in plasma would be expected tohave a higher risk than those with a lower mutational load. Furthermore,the temporal profile of such mutational load in plasma would also be apowerful indicator of risk. For example, if a subject has one plasmamutational load performed each year and if the mutational loads areprogressively increasing, then this subject should be referred foradditional screening modalities for cancer, e.g. using chest X ray,ultrasound, computed tomography, magnetic resonance imaging or positronemission tomography.

C. Dynamic Cutoffs to Deduce Mutations from Sequencing Plasma

Four patients with hepatocellular carcinoma (HCC) and one patient withovarian and breast cancer were recruited for this study. For the latterpatient, we focused on the analysis of the ovarian cancer. Blood sampleswere collected from each patient before and after surgical resection ofthe tumors. The resected tumor tissues were also collected. The DNAextracted from the tumor tissue, the white blood cells of thepreoperative blood sample and the pre- and post-operative plasma sampleswas sequenced using the HiSeq2000 sequencing system (Illumina). Thesequencing data were aligned to the reference human genome sequence(hg18) using Short Oligonucleotide Analysis Package 2 (SOAP2) (Li R etal. Bioinformatics 2009; 25: 1966-1967). The DNA sequences of the whiteblood cells were regarded as constitutional DNA sequence for each studysubject.

In this example, tumor-associated SNMs were first deduced from theplasma DNA sequencing data and the CG without reference to the tumortissues. Then, the deduced results from plasma were compared withsequencing data generated from the tumor tissues (as a gold standard) toascertain the accuracy of the deduced results. In this regard, the goldstandard was made by comparing sequencing data from the tumor tissuesand the constitutional sequence to work out the mutations in the tumortissues. In this analysis, we focused on nucleotide positions at whichthe constitutional DNA of the studied subject was homozygous.

1. Non-Targeted Whole Genome Analysis

The sequencing depths for the white cells, the tumor tissues and theplasma DNA of each patient are shown in Table 5.

TABLE 5 Median sequencing depths of different samples for the four HCCcases. Median sequencing depth (folds) White blood PreoperativePostoperative Case cells Tumor tissue plasma plasma HCC1 39 29 23 24HCC2 39 29 25 28 HCC3 46 33 18 21 HCC4 46 27 20 23 Ovarian 44 53 37 28cancer patient

The dynamic cutoffs for the minimum occurrences for defining plasmamutations (r) as shown in table 1 are used for identifying the mutationsin the plasma of each patient. As the sequencing depth of each locus mayvary, the cutoff may vary, which effectively provides a dependence ofthe cutoff on the total number of reads for a locus. For example,although the median depth is less than 50 (Table 5), the sequencingdepth of individual loci can vary a lot and be covered >100 times.

In addition to sequencing errors, another source of error would bealignment errors. To minimize this type of errors, the sequence readscarrying a mutation was realigned to the reference genome using theBowtie alignment program (Langmead B et al. Genome Biol 2009, 10:R25).Only reads that could be aligned to a unique position of the referencegenome by SOAP2 and Bowtie were used for the downstream analysis forplasma mutations. Other combinations of alignment software packagesbased on different algorithms could also be used.

In order to further minimize the sequencing and alignment errors in theactual sequencing data, we applied two additional filtering algorithmsfor calling the nucleotide positions which showed single nucleotidevariations in the sequence reads: (1) ≥70% of the sequence readscarrying the mutation could be realigned to the same genomic coordinateusing Bowtie with mapping quality≥Q20 (i.e. misalignmentprobability<1%); (2) ≥70% of the sequence reads carrying the mutationwere not within 5 bp of both ends (i.e. 5′ and 3′ ends) of the sequencereads. This filtering rule was instituted because sequencing errors weremore prevalent at both ends of a sequence read.

We also investigated the factors affecting the deducing of a tumorwithout prior knowledge of the tumor genome. One such parameter was thefractional concentration of tumor-derived DNA in plasma. This parametercould be regarded as another gold standard parameter and was deduced forreference purpose with prior knowledge of the tumor genome using GAAL.

Table 6 shows nucleotide variations detected in plasma before and overtreatment. For HCC1, without prior knowledge of the tumor genome, atotal of 961 single nucleotide variations were detected. Amongst thesenucleotide variations detected in plasma, 828 were cancer-associatedmutations. After surgical resection of the HCC, the total number ofnucleotide variations was reduced to 43 and none of them wascancer-associated mutations.

For reference purposes, the fractional concentration of tumor-derivedDNA in the pre-operative plasma sample was 53% and was deduced withprior knowledge of the tumor genome. For HCC2, HCC3 and HCC4, withoutprior knowledge of the tumor genomes, the numbers of single nucleotidevariations in plasma were deduced as ranging from 27 to 32 for thepre-operative plasma samples. These results are compatible with themathematical prediction that, with a sequencing depth of approximately20-fold, a very low percentage of cancer-associated mutations could bedetected in the plasma and most of the sequence variations detected inthe plasma were due to sequencing errors. After tumor resection, therewas no significant change in the number of sequence variations detected.For reference purposes, the fractional concentrations of tumor-derivedDNA in plasma were deduced as ranging from 2.1% to 5% and were deducedwith prior knowledge of the tumor genomes.

TABLE 6 Nucleotide variations detected in plasma. Pre-operative plasmaPost-operative plasma No. of No. of Fractional Total no. cancer-Fractional Total no. cancer- concentration of single associatedconcentration of single associated of tumor- nucleotide mutations oftumor- nucleotide mutations derived DNA variations identified derivedDNA variations identified HCC1  53% 961 828 0.4% 43 0 HCC2   5% 32 00.6% 49 0 HCC3 2.1% 29 0 0.2% 32 0 HCC4 2.6% 27 0 1.3% 35 1 Ovarian (and 46% 1718 1502 0.2% 2 0 breast)cancer patient

2. Target Enrichment of the Exons

As discussed above, increasing the sequencing depth for theregion-of-interest can increase both the sensitivity and specificity foridentifying cancer-associated mutations in plasma and, hence, increasingthe discrimination power between the cancer patients and non-cancersubjects. While the increase of sequencing depth for the whole genome isstill very costly, one alternative is to enrich for certain regions forsequencing. In one embodiment, selected exons or indeed the whole exomecan be target-enriched for sequencing. This approach can significantlyincrease the sequencing depth of the target region without increasingthe total amount of sequence reads.

The sequencing libraries of the plasma DNA of the HCC patients and thepatient with ovarian (and breast) cancer were captured using the AgilentSureSelect All Exon kit for target enrichment of the exome. Theexon-enriched sequencing libraries were then sequenced using the HiSeq2000 sequencing system. The sequence reads were aligned to the humanreference genome (hg18). After alignment, sequence reads uniquely mappedto the exons were analyzed for single nucleotide variations. For theidentification of single nucleotide variations in plasma for the exomecapture analysis, the dynamic cutoff values shown in table 2 are used.

FIG. 14A is a table 1400 showing detection of single nucleotidevariations in plasma DNA for HCC1. Without prior knowledge of the tumorgenome, we deduced from the targeted sequencing data a total of 57single nucleotide variations in plasma. In subsequent validation fromthe sequencing data obtained from the tumor tissues, 55 were found to betrue tumor-associated mutations. As discussed before, fractionalconcentration of tumor-derived DNA in the pre-operative plasma was 53%.After tumor resection, no single nucleotide variations were detected inthe targeted sequencing data obtained from the plasma. These resultsindicate that the quantitative analysis of the number of singlenucleotide variations in plasma can be used for monitoring the diseaseprogression of cancer patients.

FIG. 14B is a table 1450 showing detection of single nucleotidevariations in plasma DNA for HCC2. Without prior knowledge of the tumorgenome, we deduced from the targeted sequencing data of the plasma atotal of 18 single nucleotide variations. All of these mutations werefound in the tumor tissues. As discussed before, fractionalconcentration of tumor-derived DNA in the pre-operative plasma was 5%.After tumor resection, no single nucleotide variations were detected inthe plasma. Compared with HCC1 which had a higher fractionalconcentration of tumor-derived DNA in plasma, fewer single nucleotidevariations were detected in the plasma of the case involving HCC2. Theseresults suggest that the number of single nucleotide variations inplasma can be used as a parameter to reflect the fractionalconcentration of tumor-derived DNA in plasma and, hence, the tumor loadin the patient as it has been showed that the concentration oftumor-derived DNA in plasma is positively correlated with the tumor load(Chan K C et al. Clin Chem 2005; 51:2192-5).

FIG. 15A is a table 1500 showing detection of single nucleotidevariations in plasma DNA for HCC3. Without prior knowledge of the tumorgenome, we did not observe from the targeted sequencing data any singlenucleotide variations in both the pre- and post-resection plasmasamples. This is likely to be due to the relatively low fractionalconcentration (2.1%) of tumor-derived DNA in plasma in this patient.Further increase in the sequencing depth is predicted to improve thesensitivity for detecting cancer-associated mutations in cases with lowfractional concentration of tumor-derived DNA.

FIG. 15B is a table 1550 showing detection of single nucleotidevariations in plasma DNA for HCC4. Without prior knowledge of the tumorgenome, we deduced from the targeted sequencing data of the plasma atotal of 3 single nucleotide variations. All of these mutations werefound in the tumor tissues. Compared with HCC1 and HCC2 which had higherfractional concentrations of tumor-derived DNA in plasma, fewer singlenucleotide variations were detected in the plasma of case HCC4 which hada fractional tumor DNA in plasma of 2.6%. These results suggest that thenumber of single nucleotide variations in plasma can be used as aparameter to reflect the fractional concentration of tumor-derived DNAin plasma and tumor load in a patient.

FIG. 16 is a table 1600 showing detection of single nucleotidevariations in plasma DNA for the patient with ovarian (and breast)cancer. Without prior knowledge of the tumor genome, we deduced from thetargeted sequencing data of the plasma a total of 64 single nucleotidevariations. Amongst these 59 were found in the ovarian tumor tissues.The estimated fractional concentration of ovarian tumor-derived DNA inthe plasma was 46%. A significant reduction in the total number ofsingle nucleotide variations were detected in plasma after resection ofthe ovarian cancer.

In addition to the use of the SureSelect target enrichment system(Agilent), we also used the Nimblegen SeqCap EZ Exome+UTR targetenrichment system (Roche) for enriching sequences from exons forsequencing. The Nimblegen SeqCap system covers the exon regions of thegenome as well as the 5′ and 3′ untranslated region. The pre-treatmentplasma samples of the four HCC patients, two healthy control subjectsand two chronic hepatitis B carriers without a cancer were analyzed(Table 7). In other embodiments, other target enrichment systems,including but not limited to those using solution phase or solid phasehybridization, can be used.

TABLE 7 Exome sequencing results for the four HCC patients (HCC1-4)using the Nimblegen SeqCap EZ Exome + UTR target enrichment system forsequence capture. The sequencing analysis of the pre-treatment plasma ofHCC3 was sub-optimal due to a higher percentage of PCR-duplicated reads.Pre-treatment plasma Post-treatment plasma No. of No. of sequence No. ofsequence No. of sequence Fractional sequence variation that variationvariation that concentration variation overlap with detected overlapwith of tumor-derived detected mutations in plasma mutations DNA inplasma in plasma detected in the fulfilling detected in the by GAALfulfilling the corresponding the dynamic corresponding analysis dynamiccutoffs tumor tissue cutoffs tumor tissue HCC1  53% 69 64 1 1 HCC2   5%51 47 3 0 HCC3 2.1% 0 0 1 0 HCC4 2.6% 8 7 0 0

In the two chronic hepatitis B carriers and the two healthy controlsubjects, one or less single nucleotide variations that fulfilled thedynamic cutoff criteria were detected (Table 8). In three of the fourHCC patients, the number of sequence variations detected in plasma thatfulfilled the dynamic cutoff requirement was at least 8. In HCC3, no SNVthat fulfilled the dynamic cutoff was detected. In this sample, therewas a high proportion PCR-duplicated read in the sequenced reads leadingto a lower number of non-duplicated sequenced reads. Marked reduction ofSNVs detectable in plasma was observed after surgical resection of thetumor.

TABLE 8 Exome sequencing results for 2 chronic hepatitis B carriers(HBV1 and HBV2) and 2 healthy control subjects (Ctrl1 and Ctrl2) usingthe Nimblegen SeqCap EZ Exome + UTR target enrichment system forsequence capture. No. of sequence variation detected in plasmafulfilling the dynamic cutoffs HBV1 0 HBV2 1 Ctrl1 1 Ctrl2 1

XII. Tumor Heterogeneity

The quantification of single nucleotide mutations in a biological sample(e.g., plasma/serum) is also useful for the analysis of tumorheterogeneity, both intra-tumoral and inter-tumoral heterogeneity.Intra-tumoral heterogeneity relates to the existence of multiple clonesof tumor cells within the same tumor. Inter-tumoral heterogeneityrelates to the existence of multiple clones of tumor cells for two ormore tumors of the same histologic type, but present in different sites(either in the same organs, or in different organs). In certain types oftumors, the existence of tumoral heterogeneity is a bad prognosticindicator (Yoon H H et al. J Clin Oncol 2012; 30: 3932-3938; Merlo L M Fet al. Cancer Prev Res 2010; 3: 1388-1397). In certain types of tumors,the higher the degree of tumoral heterogeneity, the higher would be thechance of tumor progression or the development of resistant clonesfollowing targeted treatment.

Although cancers are believed to arise from the clonal expansion of onetumor cell, the growth and evolution of a cancer would lead to theaccumulation of new and different mutations in different parts of acancer. For example, when a cancer patient develops metastasis, thetumor located at the original organ and the metastatic tumor would sharea number of mutations. However, the cancer cells of the two sites wouldalso carry a unique set of mutations that are absent in the other tumorsite. The mutations that are shared by the two sites are expected to bepresent at higher concentrations than those mutations that are onlyobserved in one tumor site.

A. Example

We analyzed the blood plasma of a patient who had bilateral ovariancancers and a breast cancer. Both ovarian tumors were serousadenocarcinoma. The left one measured 6 cm and the right one measured 12cm in the longest dimension. There were also multiple metastatic lesionsat the colon and the omentum. The DNA extracted from the leukocytes weresequenced using the sequencing-by-synthesis platform from Illumina to anaverage of 44-fold haploid genome coverage. The nucleotide locationsshowing only one allele, i.e. homozygous, were analyzed further forsingle nucleotide mutations in plasma.

DNA was extracted from four different sites of the left and right tumorsand was sequenced using the Illumina sequencing platform. Two sites(sites A and B) were from the right tumor and the other two sites (sitesC and D) were from the left tumor. Sites A and B were approximately 4 cmapart. The distance between sites C and D was also approximately 4 cm.Plasma samples were collected from the patient before and after surgicalresection of the ovarian tumors. DNA was then extracted from the plasmaof the patient. The sequencing depth of the tumor from sites A, B, C andD, as well as the plasma samples, are shown in the table 9.

TABLE 9 Sequencing depth of the tumor from sites A, B, C and D. No. ofraw No. of aligned Folds of haploid Sample sequencing reads reads genomecoverage Constitutional DNA 1,091,250,072 876,269,922 43.81 from buffycoat Right ovarian tumor 1,374,495,256 1,067,277,229 53.36 (site A)Right ovarian tumor 934,518,588 803,007,464 40.15 (site B) Left ovariantumor 1,313,051,122 1,036,643,946 51.83 (site C) Left ovarian tumor1,159,091,833 974,823,207 48.74 (site D) Plasma sample 988,697,457741,982,535 37.10 collected before Plasma sample 957,295,879 564,623,12728.23 collected after

In the current example, for defining a single tumor-associated singlenucleotide mutation, the nucleotide location is sequenced at least 20times in the tumor tissue and 30 times in the constitutional DNA. Inother embodiments, other sequencing depths can be used, e.g. 35, 40, 45,50, 60, 70, 80, 90, 100 and >100 folds. The reduction of sequencingcosts would allow increased depths to be performed much more readily.The nucleotide position is homozygous in the constitutional DNA whereasa nucleotide change is observed in the tumor tissue. The criterion forthe occurrence of the nucleotide change in the tumor tissue is dependenton the total sequencing depth of the particular nucleotide position inthe tumor tissue. For nucleotide coverage from 20 to 30 folds, theoccurrence of the nucleotide change (cutoff value) is at least fivetimes. For coverage from 31 to 50 folds, the occurrence of thenucleotide change is at least six times. For coverage from 51 to 70folds, the occurrence needs is at least seven times. These criteria arederived from the prediction of sensitivity of detecting the truemutations and the expected number of false positive loci using thePoisson distribution.

FIG. 17 is a table 1700 showing the predicted sensitivities of differentrequirements of occurrence and sequencing depths. The sensitivity wouldcorrespond to the number of true mutations detected at a particular folddepth using a particular cutoff. The higher sequencing depth, the morelikely it is for a mutation to be detected for a given cutoff, as moremutation sequence reads will be obtained. For higher cutoff values, theless likely a mutant would be detected, since the criteria are morestringent.

FIG. 18 is a table 1800 showing the predicted numbers of false positiveloci for different cutoffs and different sequencing depths. The numberof false positives increases with increasing sequencing depth, as moresequence reads are obtained. However, no false positives are predictedfor a cutoff of five or more, even up to a sequencing depth of 70. Inother embodiments, different criteria of occurrence can be used so as toachieve the desired sensitivity and specificity.

FIG. 19 shows a tree diagram illustrating the number of mutationsdetected in the different tumor sites. The mutations were determined bysequencing the tumors directly. Site A has 71 mutations that arespecific to that tumor, and site B has 122 site-specific mutations, eventhough they were only 4 cm apart. 10 mutations were seen in both sites Aand B. Site C has 168 mutations that are specific to that tumor, andsite D has 248 site-specific mutations, even though they were only 4 cmapart. 12 mutations were seen in both sites C and D. There issignificant heterogeneity in the mutational profiles for the differenttumor sites. For example, 248 mutations were only detected in the site Dtumor but not detected in the other three tumor sites. A total of 2,129mutations were seen across all sites. Thus, many mutations were sharedamong the different tumors. Thus, there were seven SNV groups. Therewere no observable differences among these four regions in terms of copynumber aberrations

FIG. 20 is a table 2000 showing the number of fragments carrying thetumor-derived mutations in the pre-treatment and post-treatment plasmasample. The inferred fractional concentrations of tumor-derived DNAcarrying the respective mutations were also shown. The category ofmutation refers to the tumor site(s) where the mutations were detected.For example, category A mutations refer to mutations only present insite A whereas category ABCD mutations refer to mutations present in allthe four tumor sites.

For the 2,129 mutations that were present in all four tumor sites, 2,105(98.9%) were detectable in at least one plasma DNA fragment. On theother hand, for the 609 mutations that were present in only one of thefour tumor sites, only 77 (12.6%) were detectable in at least one plasmaDNA fragment. Therefore, the quantification of single nucleotidemutations in plasma can be used for reflecting the relative abundance ofthese mutations in the tumor tissues. This information would be usefulfor the study of the cancer heterogeneity. In this example, a potentialmutation was called when it had been seen once in the sequencing data.

The fractional concentrations of circulating tumor DNA were determinedwith each SNV group. The fractional concentrations of tumor DNA inplasma before surgery and after surgery, as determined by SNVs shared byall 4 regions (i.e., group ABCD), were 46% and 0.18%, respectively.These latter percentages correlated well with those obtained in GAALanalyses, 46% and 0.66%. Mutations that were shared by all 4 regions(i.e., group ABCD) contributed the highest fractional contribution oftumor-derived DNA to the plasma.

The fractional concentrations of tumor-derived DNA in preoperativeplasma determined with SNVs from groups AB and CD were 9.5% and 1.1%,respectively. These concentrations were consistent with the relativesizes of the right and left ovarian tumors. The fractionalconcentrations of tumor-derived DNA determined with the region-uniqueSNVs (i.e., those in groups A, B, C, and D) were generally low. Thesedata suggest that for an accurate measurement of the total tumor load ina cancer patient, the use of a genomewide shotgun approach might providea more representative picture, compared with the more traditionalapproach of targeting specific tumor-associated mutations. For thelatter approach, if only a subset of the tumor cells possesses thetargeted mutations, one might miss important information regardingimminent relapse or disease progression caused by tumor cells notpossessing the targeted mutations, or one might miss the emergence of atreatment-resistant clone.

FIG. 21 is a graph 2100 showing distributions of occurrence in plasmafor the mutations detected in a single tumor site and mutations detectedin all four tumor sites. The bar graph 2100 shows data for two types ofmutation: (1) mutations detected in only one site and (2) mutationsdetected in all four tumor sites. The horizontal axis is the number oftimes that a mutation is detected in the plasma. The vertical axis showsthe percentage of mutations that correspond to a particular value on thehorizontal axis. For example, about 88% of type (1) mutations showed uponly once in the plasma. As you can see, the mutations that showed up inone site were detected mostly once, and not more than four times. Themutations present in a single tumor site were much less frequentlydetected in the plasma compared with the mutations present in all fourtumor sites.

One application of this technology would be to allow the clinicians toestimate the load of tumor cells carrying the different classes ofmutations. A proportion of these mutations would potentially betreatable with targeted agents. Agents targeting mutations carried by ahigher proportion of tumor cells would be expected to have a moreprominent therapeutic effects.

FIG. 22 is a graph 2200 showing predicted distribution of occurrence inplasma for mutations coming from a heterogeneous tumor. The tumorcontains two groups of mutations. One group of mutations are present inall tumor cells and the other group of mutations are only present in ¼of the tumor cells, based on an approximation that two sites arerepresentative of each ovarian tumor. The total fractional concentrationof tumor-derived DNA in plasma is assumed to be 40%. The plasma sampleis assumed to be sequenced to an average depth of 50 times pernucleotide position. According to this predicted distribution ofoccurrence in plasma, the mutations that are present in all tumortissues can be differentiated from the mutations only present in ¼ tumorcells by their occurrence in plasma. For example, the occurrence of 6times can be used as a cutoff. For the mutations present in all tumorcells, 92.3% of the mutations would be present in the plasma for atleast 6 times. In contrast, for the mutations that are present in ¼tumor cells, only 12.4% of mutations would be present in the plasma forat least 6 times.

FIG. 23 is a table 2300 demonstrating the specificity of embodiments for16 healthy control subjects. Their plasma DNA samples were sequenced toa median coverage of 30 folds. Detection of the mutations that werepresent in the plasma of the above ovarian cancer patient was performedin the plasma samples of these healthy subjects. The mutations presentin the tumor of the ovarian cancer patient were very infrequentlydetected in the sequencing data of the plasma of the healthy controlsubjects and none of the category of mutations had an apparentfractional concentration of >1%. These results show that this detectionmethod is highly specific.

B. Method

FIG. 24 is a flowchart of a method 2400 for analyzing a heterogeneity ofone or more tumors of a subject according to embodiments of the presentinvention. Certain steps of method 2400 may be performed as describedherein,

At block 2410, a constitutional genome of the subject is obtained. Atblock 2420, one or more sequence tags are received for each of aplurality of DNA fragments in a biological sample of the subject, wherethe biological sample includes cell-free DNA. At block 2430, genomicpositions are determined for the sequence tags. At block 2440, thesequence tags are compared to the constitutional genome to determine afirst number of first loci. At each first loci, a number of the sequencetags having a sequence variant relative to the constitutional genome isabove a cutoff value, where the cutoff value is greater than one.

At block 2450, a measure of heterogeneity of the one or more tumors arecalculated based on the respective first numbers of the set of firstgenomic locations. In one aspect, the measures can provide a value thatrepresents a number of mutations that are shared by tumors relative to anumber of mutations that are not shared by tumors. Here, various tumorscan exist as a single object, with different tumors within the object,which may represent what is normally called intra-tumor heterogeneity.The measure can also relate to whether some mutations are in one or afew tumors compared to mutations that are in many or most tumors. Morethan one measure of heterogeneity can be calculated.

At block 2460, the heterogeneity measure can be compared to a thresholdvalue to determine a classification of a level of heterogeneity. The oneor more measured can be used in various ways. For example, one or moreheterogeneity measure measures can be used to predict the chance oftumor progression. In some tumors, the more heterogeneity the higher isthe chance of progression and the higher is the chance of emergence of aresistant clone following treatment (e.g. targeted treatment).

C. Tumor Heterogeneity Measures

One example of a heterogeneity measure is number of ‘concentrationbands’ of different groups of mutations in plasma. For example, if thereare two predominant tumor clones within a patient, and if these clonesare present in different concentrations, then we would expect to see twodifferent mutations with different concentrations in plasma. Thesedifferent values can be computed by determining the fractionalconcentration for different sets of mutations, where each setcorresponds to one of the tumors.

Each of these concentrations can be called a ‘concentration band’ or‘concentration class’. If a patient has more clones, then moreconcentration bands/classes will be seen. Thus, the more bands, the moreheterogeneous. The number of concentration bands can be seen by plottingthe fractional concentrations for various mutations. A histogram can bemade for the various concentrations, where different peaks correspond todifferent tumors (or different clones of one tumor). A large peak willlikely be for mutations that are shared by all or some tumors (or clonesof a tumor). These peaks may be analyzed to determine which smallerpeaks are combined to determine a larger peak. A fitting procedure maybe used, e.g., similar to the fitting procedure for FIGS. 10B and 11.

In one implementation, the histogram is a plot with Y-axis being theamount (e.g., number or proportion) of loci and x-axis being thefractional concentration. Mutations that are shared by all or sometumors would result in a higher fractional concentration. The peak sizewould represent the amount of loci that give rise to a particularfractional concentration. The relative size of the peaks at low and highconcentration would reflect the degree of heterogeneity of the tumors(or clones of a tumor). A larger peak at the high concentration reflectsthat most mutations are shared by most or all tumors (or clones of atumor) and indicate a lower degree of tumor heterogeneity. If the peakat the low concentration is larger, then most mutations are shared by afew tumors (or a few clones of a tumor). This would indicate a higherdegree of tumor heterogeneity

The more peaks that exist, the more site-specific mutations there are.Each peak can correspond to a different set of mutations, where the setof mutations are from a subset of the tumors (e.g., just one or twotumors—as illustrated above). For the example of FIG. 19, there might bea total of 7 peaks, with the 4 site-only peaks likely having thesmallest concentration (depending on the relative size of the tumors),two peaks for AB sites and CD sites, and a peak for mutations shared byall sites.

The location of the peaks can also provide a relative size of thetumors. A larger concentration would correlate to a larger tumor, as alarger tumor would release more tumor DNA into the sample, e.g., intoplasma. Thus, one could estimate the load of tumor cells carrying thedifferent classes of mutations.

Another example of a heterogeneity measure is the proportion of mutationsites having relatively few variant reads (e.g., 4, 5, or 6) compared tothe proportion of mutation reads having relatively high variant reads(e.g., 9-13). Referring back to FIG. 22, one can see that thesire-specific mutations had fewer variant reads (which also results in asmaller fractional concentration). The share mutations have more variantreads (which also results in a larger fractional concentration). A ratioof a first proportion at 6 (smaller count) divided by a secondproportion at 10 (larger count) conveys a heterogeneity measure. If theratio is small, then there are few mutations that are site-specific, andthus the level of heterogeneity is low. If the ratio is large (or atleast larger than values calibrated from known specimens), then thelevel of heterogeneity is larger.

D. Determining Thresholds

The threshold values can be determined from subjects whose tumors arebiopsied (e.g., as described above) to directly determine a level ofheterogeneity. The level may be defined in various ways, such as ratiosof site-specific mutations to shared mutations. Biological samples(e.g., plasma samples) can then be analyzed to determine heterogeneitymeasures, where a heterogeneity measure from the biological samples canbe associated with the level of heterogeneity determined by analyzingthe cells of the tumors directly.

Such a procedure can provide a calibration of thresholds relative toheterogeneity levels. If the test heterogeneity measure falls betweentwo thresholds, then the level of heterogeneity can be estimated asbeing between the levels corresponding to the thresholds.

In one embodiment, a calibration curve can be calculated between theheterogeneity levels determined from the biopsies and the correspondingheterogeneity measure determined from the plasma sample (or othersample). In such an example, the heterogeneity levels are numeric, wherethese numeric levels can correspond to different classifications.Different ranges of numeric levels can correspond to differentdiagnoses, e.g., different stages of cancer.

E. Method Using Fractional Concentration from Genomic Representation

Tumor heterogeneity can also be analyzed using the fractionalconcentration, e.g., as determined using embodiments of method 1200. Thegenomic regions that exhibit one copy loss might come from differenttumors. Thus, the fractional concentration determined for variousgenomic regions might differ depending on whether the amplification (ordeletion for 1-copy loss) exists in just one tumor or multiple tumors.Thus, the same heterogeneity measures may be used for fractionalconcentrations determined via embodiments of method 1200.

For example, one genomic region can be identified as corresponding to a1-copy loss, and a fractional concentration can be determined just froma respective density at that genomic region (the respective densitycould be used as a fractional concentration). A histogram can bedetermined from the various respective densities by counting the numberof regions having various densities. If only one tumor or one tumorclone or one tumor deposit had a gain in a particular region, then thedensity of that region would be less than the density in a region thathad a gain in multiple tumors or multiple tumor clones or multiple tumordeposits (i.e., the fractional concentration of tumor DNA in the sharedregion would be larger than the site-specific region). The heterogeneitymeasures described above can thus be applied to peaks identified usingthe copy number gain or loss in various regions, just as the fractionalconcentration of different sites showed a distribution of fractionalconcentrations.

In one implementation, if the respective densities are used for thehistogram, one would have gains and losses separated. The regionsshowing a gain could be analyzed separately by creating a histogram justfor gains, and a separate histogram can be created just for losses. Ifthe fractional concentration is used, then the peaks of losses and gainscan be analyzed together. For example, the fractional concentrations usea difference (e.g., as an absolute value) to the reference density, andthus the fractional concentrations for gains and losses can contributeto the same peak.

XIII. Computer System

Any of the computer systems mentioned herein may utilize any suitablenumber of subsystems. Examples of such subsystems are shown in FIG. 25in computer apparatus 2500. In some embodiments, a computer systemincludes a single computer apparatus, where the subsystems can be thecomponents of the computer apparatus. In other embodiments, a computersystem can include multiple computer apparatuses, each being asubsystem, with internal components.

The subsystems shown in FIG. 25 are interconnected via a system bus2575. Additional subsystems such as a printer 2574, keyboard 2578, fixeddisk 2579, monitor 2576, which is coupled to display adapter 2582, andothers are shown. Peripherals and input/output (I/O) devices, whichcouple to I/O controller 2571, can be connected to the computer systemby any number of means known in the art, such as serial port 2577. Forexample, serial port 2577 or external interface 2581 (e.g. Ethernet,Wi-Fi, etc.) can be used to connect computer system 2500 to a wide areanetwork such as the Internet, a mouse input device, or a scanner. Theinterconnection via system bus 2575 allows the central processor 2573 tocommunicate with each subsystem and to control the execution ofinstructions from system memory 2572 or the fixed disk 2579, as well asthe exchange of information between subsystems. The system memory 2572and/or the fixed disk 2579 may embody a computer readable medium. Any ofthe values mentioned herein can be output from one component to anothercomponent and can be output to the user.

A computer system can include a plurality of the same components orsubsystems, e.g., connected together by external interface 2581 or by aninternal interface. In some embodiments, computer systems, subsystem, orapparatuses can communicate over a network. In such instances, onecomputer can be considered a client and another computer a server, whereeach can be part of a same computer system. A client and a server caneach include multiple systems, subsystems, or components.

It should be understood that any of the embodiments of the presentinvention can be implemented in the form of control logic using hardware(e.g. an application specific integrated circuit or field programmablegate array) and/or using computer software with a generally programmableprocessor in a modular or integrated manner. As used herein, a processorincludes a multi-core processor on a same integrated chip, or multipleprocessing units on a single circuit board or networked. Based on thedisclosure and teachings provided herein, a person of ordinary skill inthe art will know and appreciate other ways and/or methods to implementembodiments of the present invention using hardware and a combination ofhardware and software.

Any of the software components or functions described in thisapplication may be implemented as software code to be executed by aprocessor using any suitable computer language such as, for example,Java, C++ or Perl using, for example, conventional or object-orientedtechniques. The software code may be stored as a series of instructionsor commands on a computer readable medium for storage and/ortransmission, suitable media include random access memory (RAM), a readonly memory (ROM), a magnetic medium such as a hard-drive or a floppydisk, or an optical medium such as a compact disk (CD) or DVD (digitalversatile disk), flash memory, and the like. The computer readablemedium may be any combination of such storage or transmission devices.

Such programs may also be encoded and transmitted using carrier signalsadapted for transmission via wired, optical, and/or wireless networksconforming to a variety of protocols, including the Internet. As such, acomputer readable medium according to an embodiment of the presentinvention may be created using a data signal encoded with such programs.Computer readable media encoded with the program code may be packagedwith a compatible device or provided separately from other devices(e.g., via Internet download). Any such computer readable medium mayreside on or within a single computer program product (e.g. a harddrive, a CD, or an entire computer system), and may be present on orwithin different computer program products within a system or network. Acomputer system may include a monitor, printer, or other suitabledisplay for providing any of the results mentioned herein to a user.

Any of the methods described herein may be totally or partiallyperformed with a computer system including one or more processors, whichcan be configured to perform the steps. Thus, embodiments can bedirected to computer systems configured to perform the steps of any ofthe methods described herein, potentially with different componentsperforming a respective steps or a respective group of steps. Althoughpresented as numbered steps, steps of methods herein can be performed ata same time or in a different order. Additionally, portions of thesesteps may be used with portions of other steps from other methods. Also,all or portions of a step may be optional. Additionally, any of thesteps of any of the methods can be performed with modules, circuits, orother means for performing these steps.

The specific details of particular embodiments may be combined in anysuitable manner without departing from the spirit and scope ofembodiments of the invention. However, other embodiments of theinvention may be directed to specific embodiments relating to eachindividual aspect, or specific combinations of these individual aspects.

The above description of exemplary embodiments of the invention has beenpresented for the purposes of illustration and description. It is notintended to be exhaustive or to limit the invention to the precise formdescribed, and many modifications and variations are possible in lightof the teaching above. The embodiments were chosen and described inorder to best explain the principles of the invention and its practicalapplications to thereby enable others skilled in the art to best utilizethe invention in various embodiments and with various modifications asare suited to the particular use contemplated.

A recitation of “a”, “an” or “the” is intended to mean “one or more”unless specifically indicated to the contrary.

All patents, patent applications, publications, and descriptionsmentioned here are incorporated by reference in their entirety for allpurposes. None is admitted to be prior art.

What is claimed is:
 1. A method for detecting cancer or premalignantchange in a subject, the method comprising: obtaining a constitutionalgenome of the subject; receiving one or more sequence tags for each of aplurality of DNA fragments in a biological sample of the subject, thebiological sample including cell-free DNA; determining genomic positionsfor the sequence tags; comparing the sequence tags to the constitutionalgenome to determine a first number of first loci, wherein: at each ofthe first loci, a number of the sequence tags having a sequence variantrelative to the constitutional genome is above a cutoff value, thecutoff value being greater than one; determining a parameter based on acount of sequence tags having a sequence variant at the first loci; andcomparing the parameter to a threshold value to determine aclassification of a level of cancer in the subject.
 2. The method ofclaim 1, wherein the threshold value is determined from one or moresamples from one or more other subjects.
 3. The method of claim 1,wherein the cutoff value for a locus is dependent on a total number ofsequence tags that have a genomic position at the locus.
 4. The methodof claim 1, wherein different cutoff values are used for at least two ofthe first loci.
 5. The method of claim 4, further comprising:dynamically determining a first cutoff value for one of the first loci,the one of the first loci residing within a first region.
 6. The methodof claim 5, wherein the first cutoff value is determined based on asequencing depth of the one of the first loci.
 7. The method of claim 5,wherein the first cutoff value is determined based on a false positiverate that depends on a sequencing error rate, a sequencing depth of thefirst region, and a number of nucleotide positions in the first region.8. The method of claim 7, wherein the first cutoff value is determinedbased on a number of true positives in the first region.
 9. The methodof claim 8, further comprising: calculating the number of true positivesfor the first cutoff value based on the sequencing depth D of the firstregion and a fractional concentration f of tumor-derived DNA in thebiological sample.
 10. The method of claim 9, wherein calculating thenumber of true positives uses a Poisson distribution probabilityaccording to the formula: Pb=1−Σ_(i=0) ^(r−1)Poisson(i, M_(P)), where Pbis a probability for detecting true positives, and r is the first cutoffvalue, and Mp=D×f/2.
 11. The method of claim 5, wherein the first cutoffvalue is determined using any one of the following criteria: if asequencing depth is less than 50 then the first cutoff value is 5, ifthe sequencing depth is 50-110 then the first cutoff value is 6, if thesequencing depth is 111-200 then the first cutoff value is 7, if thesequencing depth is 201-310 then the first cutoff value is 8, if thesequencing depth is 311-450 then the first cutoff value is 9, if thesequencing depth is 451-620 then the first cutoff value is 10, and ifthe sequencing depth is 621-800 then the first cutoff value is
 11. 12.The method of claim 1, wherein the parameter is a weighted sum of thefirst number of first loci, wherein a contribution of each of the firstloci is weighted based on an importance value assigned to the respectivefirst loci.
 13. The method of claim 1, wherein the parameter includes asum of the sequence tags indicating a sequence variant at the firstnumber of first loci.
 14. The method of claim 13, wherein the sum is aweighted sum, and wherein one of the first loci has a first weight thatis different than a second weight of a second of the first loci.
 15. Themethod of claim 14, wherein the first weight is greater than the secondweight, and wherein the one of the first loci is associated with cancer,and the second of the first loci is not associated with cancer.
 16. Themethod of claim 1, wherein the parameter is the first number of firstloci.
 17. The method of claim 1, wherein determining a genomic positionfor a sequence tag includes: aligning at least a portion of the sequencetags to a reference genome, wherein the alignment of a sequence tagallows for one or more mismatches between the sequence tag and thereference genome.
 18. The method of claim 17, wherein comparing thesequence tags to the constitutional genome includes: comparing theconstitutional genome to the reference genome to determine a secondnumber of second loci having a variant relative to the reference genome;based on the aligning, determining a third number of third loci,wherein: at each of the third loci, a number of the sequence tags havinga sequence variant relative to the reference genome is above a cutoffvalue; and taking a difference of the third number and the second numberto obtain the first number of first loci.
 19. The method of claim 18,wherein taking the difference of the third number and the second numberidentifies the first loci.
 20. The method of claim 19, whereindetermining the parameter includes: for each locus of the first numberof first loci: counting sequence tags that align to the locus and have asequence variant at the locus; and determining the parameter based onthe respective counts.
 21. The method of claim 1, wherein theconstitutional genome is derived from a constitutional sample from thesubject that contains more than 50% constitutional DNA.
 22. The methodof claim 1, wherein determining a genomic position for a sequence tagincludes: aligning at least a portion of the sequence tags to theconstitutional genome, wherein the alignment of a sequence tag allowsfor one or more mismatches between the sequence tag and theconstitutional genome.
 23. The method of claim 22, wherein comparing thesequence tags to the constitutional genome includes: based on thealigning, identifying sequence tags that have a sequence variant at agenomic location relative to the constitutional genome of the subject;for each genomic location of a plurality of genomic locations exhibitinga sequence variant: counting a respective number of sequence tags thatalign to the genomic location and have a sequence variant at the genomiclocation; and determining a parameter based on the respective numbers.24. The method of claim 23, wherein determining the parameter based onthe respective numbers includes: summing the respective numbers toobtain a first sum; and using the first sum to determine the parameter.25. The method of claim 24, wherein using the first sum to determine theparameter includes: subtracting the number of genomic locationsexhibiting a sequence variant from the first sum.
 26. The method ofclaim 24, wherein using the first sum to determine the parameterincludes: normalizing the first sum based on an amount of sequence tagsaligned.
 27. The method of claim 1, further comprising: obtaining aconstitutional sample of the subject that contains more than 90%constitutional DNA; performing random sequencing of DNA fragments in theconstitutional sample to obtain one or more second sequence tags foreach of a plurality of DNA fragments in the constitutional sample;aligning at least a portion of the second sequence tags to a referencegenome, wherein the alignment of a second sequence tag allows for amismatch between the second sequence tag and the reference genome at Mor less genomic locations, wherein M is an integer equal to or greaterthan one; and constructing the constitutional genome based on the secondsequence tags and the aligning.
 28. The method of claim 27, wherein theconstitutional sample is the biological sample, and wherein constructingthe constitutional genome includes: determining a consensus sequencethat includes a determination of a homozygous locus or a heterozygouslocus having two alleles; and using the consensus sequence at theconstitutional genome.
 29. The method of claim 1, wherein the one ormore sequence tags are generated from a random sequencing of DNAfragments in the biological sample.
 30. The method of claim 29, furthercomprising: receiving the biological sample of the subject; andperforming the random sequencing of DNA fragments in the biologicalsample to generate the one or more sequence tags for each of a pluralityof DNA fragments in the biological sample.
 31. The method of claim 1,wherein the biological sample is urine, pleural fluid, ascitic fluid,peritoneal fluid, saliva, cerebrospinal fluid, or a stool sample. 32.The method of claim 1, wherein the parameter is a fractionalconcentration of tumor-derived DNA.
 33. A computer product comprising anon-transitory computer readable medium storing a plurality ofinstructions for detecting cancer or premalignant change in a subjectthat when executed control a computer system to perform: obtaining aconstitutional genome of the subject; receiving one or more sequencetags for each of a plurality of DNA fragments in a biological sample ofthe subject, the biological sample including cell-free DNA; determininggenomic positions for the sequence tags; comparing the sequence tags tothe constitutional genome to determine a first number of first loci,wherein: at each of the first loci, a number of the sequence tags havinga sequence variant relative to the constitutional genome is above acutoff value, the cutoff value being greater than one; determining aparameter based on a count of sequence tags having a sequence variant atthe first loci; and comparing the parameter to a threshold value todetermine a classification of a level of cancer in the subject.